From d91b246400c5d3c7ee5836e68a080bba637596a3 Mon Sep 17 00:00:00 2001 From: Kshitiz Lohia Date: Tue, 25 Jul 2023 17:56:10 +0530 Subject: [PATCH 1/3] added feature store notebooks --- .../feature_store_querying.ipynb | 5537 +++++++++++++++++ .../feature_store_quickstart.ipynb | 1940 ++++++ .../feature_store_schema_evolution.ipynb | 3546 +++++++++++ .../feature_store_spark_magic.ipynb | 1017 +++ 4 files changed, 12040 insertions(+) create mode 100644 notebook_examples/feature_store_querying.ipynb create mode 100644 notebook_examples/feature_store_quickstart.ipynb create mode 100644 notebook_examples/feature_store_schema_evolution.ipynb create mode 100644 notebook_examples/feature_store_spark_magic.ipynb diff --git a/notebook_examples/feature_store_querying.ipynb b/notebook_examples/feature_store_querying.ipynb new file mode 100644 index 00000000..f72d6772 --- /dev/null +++ b/notebook_examples/feature_store_querying.ipynb @@ -0,0 +1,5537 @@ +{ + "cells": [ + { + "cell_type": "raw", + "id": "7e04d02d", + "metadata": { + "pycharm": { + "name": "#%% raw\n" + } + }, + "source": [ + "qweews@notebook{feature_store-querying.ipynb,\n", + " title: Using feature store for feature querying using pandas like interface for query and join,\n", + " summary: Feature store quickstart guide to perform feature querying using pandas like interface for query and join.,\n", + " developed_on: pyspark32_p38_cpu_feature_store_v1,\n", + " keywords: feature store, querying,\n", + " license: Universal Permissive License v 1.0\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d325ddb", + "metadata": { + "ExecuteTime": { + "end_time": "2023-05-24T08:26:08.572567Z", + "start_time": "2023-05-24T08:26:08.328013Z" + }, + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "# Upgrade Oracle ADS to pick up the latest preview version to maintain compatibility with Oracle Cloud Infrastructure.\n", + "\n", + "!odsc conda install --uri https://objectstorage.us-ashburn-1.oraclecloud.com/n/bigdatadatasciencelarge/b/service-conda-packs-fs/o/service_pack/cpu/PySpark_3.2_and_Feature_Store/1.0/fspyspark32_p38_cpu_v1#conda" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "544cf0fe", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "eff8a822", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "Oracle Data Science service sample notebook.\n", + "\n", + "Copyright (c) 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).\n", + "\n", + "***\n", + "\n", + "# Feature store handling querying operations\n", + "

by the Oracle Cloud Infrastructure Data Science Service.

\n", + "\n", + "---\n", + "# Overview:\n", + "---\n", + "Managing many datasets, data-sources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift and training serving skew all leads to increased model development time and worse model performance. Here, feature store is well positioned to solve many of the problems since it provides a centralised way to transform and access data for training and serving time and helps defines a standardised pipeline for ingestion of data and querying of data. This notebook demonstrates how to use feature store within a long lasting [Oracle Cloud Infrastructure Data Flow](https://docs.oracle.com/en-us/iaas/data-flow/using/home.htm) cluster.\n", + "\n", + "Compatible conda pack: [PySpark 3.2 and Feature store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8\n", + "\n", + "## Contents:\n", + "\n", + "- 1. Introduction\n", + "- 1. Pre-requisites\n", + " - 2.1 Policies\n", + " - 2.2 Authentication\n", + " - 2.3 Variables\n", + "- 3. Feature store querying\n", + " - 3.1. Exploration of data in feature store\n", + " - 3.2. Load feature groups\n", + " - 3.3. Explore feature groups\n", + " - 3.4. Select subset of features\n", + " - 3.5. Filter feature groups\n", + " - 3.6. Apply joins on feature group\n", + " - 3.7. Create dataset from multiple or one feature group\n", + " - 3.8 Free form sql query\n", + "- 4. References\n", + "\n", + "---\n", + "\n", + "**Important:**\n", + "\n", + "Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = \"\"` would become `database_name = \"production\"`.\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "208425ef", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "# 1. Introduction\n", + "\n", + "Oracle feature store is a stack based solution that is deployed in the customer enclave using OCI resource manager. Customer can stand up the service with infrastructure in their own tenancy. The service consists of API which are deployed in customer tenancy using resource manager.\n", + "\n", + "The following are some key terms that will help you understand OCI Data Science Feature Store:\n", + "\n", + "\n", + "* **Feature Vector**: Set of feature values for any one primary/identifier key. Eg. All/subset of features of customer id ‘2536’ can be called as one feature vector.\n", + "\n", + "* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.\n", + "\n", + "* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Another way to look at it is that an entity is an object or concept that is described by its features. Examples of entities could be customer, product, transaction, review, image, document, etc.\n", + "\n", + "* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in ml models. It serves as an organizational unit within the feature store for users to manage, version and share features across different ml projects. By organizing features into groups, data scientists and ml engineers can efficiently discover, reuse and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.\n", + "\n", + "* **Feature Group Job**: Feature group job is the execution instance of a feature group. Each feature group job will include validation results and statistics results.\n", + "\n", + "* **Dataset**: A dataset is a collection of feature that are used together to either train a model or perform model inference.\n", + "\n", + "* **Dataset Job**: Dataset job is the execution instance of a dataset. Each dataset job will include validation results and statistics results." + ] + }, + { + "cell_type": "markdown", + "id": "0bb56df6", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "# 2. Pre-requisites\n", + "\n", + "Data Flow Sessions are accessible through the following conda environment:\n", + "\n", + "* **PySpark 3.2, Feature store 1.0 and Data Flow 1.0 (fs_pyspark32_p38_cpu_v1)**\n", + "\n", + "The [Data Catalog Hive Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and unstructured data assets. The Metastore is the central metadata repository to understand tables backed by files on object storage. You can customize `fs_pyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Data Flow session cluster. The metastore id of hive metastore is tied to feature store construct of feature store service.\n" + ] + }, + { + "cell_type": "markdown", + "id": "5669e712", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### `spark-defaults.conf`\n", + "\n", + "The `spark-defaults.conf` file is used to define the properties that are used by Spark. A templated version is installed when you install a Data Science conda environment that supports PySpark. However, you must update the template so that the Data Catalog metastore can be accessed. You can do this manually. However, the `odsc data-catalog config` commandline tool is ideal for setting up the file because it gathers information about your environment, and uses that to build the file.\n", + "\n", + "The `odsc data-catalog config` command line tool needs the `--metastore` option to define the Data Catalog metastore OCID. No other command line option is needed because settings have default values, or they take values from your notebook session environment. Following are common parameters that you may need to override.\n", + "\n", + "The `--authentication` option sets the authentication mode. It supports resource principal and API keys. The preferred method for authentication is resource principal, which is sent with `--authentication resource_principal`. If you want to use API keys, then use the `--authentication api_key` option. If the `--authentication` isn't specified, API keys are used. When API keys are used, information from the OCI configuration file is used to create the `spark-defaults.conf` file.\n", + "\n", + "Object Storage and Data Catalog are regional services. By default, the region is set to the region your notebook session is running in. This information is taken from the environment variable, `NB_REGION`. Use the `--region` option to override this behavior.\n", + "\n", + "The default location of the `spark-defaults.conf` file is `/home/datascience/spark_conf_dir` as defined in the `SPARK_CONF_DIR` environment variable. Use the `--output` option to define the directory where to write the file.\n", + "\n", + "You need to determine what settings are appropriate for your configuration. However, the following works for most configurations and is run in a terminal window.\n", + "\n", + "```bash\n", + "odsc data-catalog config --authentication resource_principal --metastore \n", + "```\n", + "For more assistance, use the following command in a terminal window:\n", + "\n", + "```bash\n", + "odsc data-catalog config --help\n", + "```\n", + "\n", + "\n", + "### Session Setup\n", + "\n", + "The notebook makes connections to the Data Catalog metastore and Object Storage. In the next cell, specify the bucket URI to act as the data warehouse. Use the `warehouse_uri` variable with the `oci://@/` format. Update the variable `metastore_id` with the OCID of the Data Catalog metastore." + ] + }, + { + "cell_type": "markdown", + "id": "e0977c6c", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 2.1. Policies\n", + "This section covers the creation of dynamic groups and policies needed to use the service.\n", + "\n", + "* [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm/)\n", + "* [Data Catalog Metastore Required Policies](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm)\n", + "* [Getting Started with Data Flow](https://docs.oracle.com/iaas/data-flow/using/dfs_getting_started.htm)\n", + "* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)" + ] + }, + { + "cell_type": "markdown", + "id": "455ddd75", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 2.2. Authentication\n", + "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook cluster.
\n", + "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "964842e8", + "metadata": { + "ExecuteTime": { + "start_time": "2023-05-24T08:26:08.577504Z" + }, + "is_executing": true, + "pycharm": { + "is_executing": true, + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "import ads\n", + "ads.set_auth(auth=\"resource_principal\", client_kwargs={\"service_endpoint\": \"\"})" + ] + }, + { + "cell_type": "markdown", + "id": "3eeb7367", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 2.3. Variables\n", + "To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `` and bucket `` for offline feature store." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "8471ee05", + "metadata": { + "pycharm": { + "is_executing": true, + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "compartment_id = \"\"\n", + "metastore_id = \"\"" + ] + }, + { + "cell_type": "markdown", + "id": "4bcfeb4c", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "# 3. Feature group querying\n", + "By default the **PySpark 3.2, Feature store and Data Flow** conda environment includes pre-installed [great-expectations](https://legacy.docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html) and [deeque](https://github.com/awslabs/deequ) libraries. The joining functionality is heavily inspired by the APIs used by Pandas to merge, join or filter DataFrames. The APIs allow you to specify which features to select from which feature group, how to join them and which features to use in join conditions." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "b46d9ca9", + "metadata": { + "pycharm": { + "is_executing": true, + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "import warnings\n", + "warnings.filterwarnings(\"ignore\", message=\"iteritems is deprecated\")\n", + "warnings.filterwarnings(\"ignore\", category=DeprecationWarning)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "ef297a89", + "metadata": { + "pycharm": { + "is_executing": true, + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/ads/model/deployment/model_deployment.py:48: DeprecationWarning: The `ads.model.deployment.model_deployment_properties` is deprecated in `oracle-ads 2.8.6` and will be removed in `oracle-ads 3.0`.Use `ModelDeploymentInfrastructure` and `ModelDeploymentRuntime` classes in `ads.model.deployment` module for configuring model deployment. Check https://accelerated-data-science.readthedocs.io/en/latest/user_guide/model_registration/introduction.html\n", + " from .model_deployment_properties import ModelDeploymentProperties\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/ads/model/deployment/__init__.py:7: DeprecationWarning: The `ads.model.deployment.model_deployer` is deprecated in `oracle-ads 2.8.6` and will be removed in `oracle-ads 3.0`.Use `ModelDeployment` class in `ads.model.deployment` module for initializing and deploying model deployment. Check https://accelerated-data-science.readthedocs.io/en/latest/user_guide/model_registration/introduction.html\n", + " from .model_deployer import ModelDeployer\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/pandas/__init__.py:44: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " LooseVersion(pyarrow.__version__) >= LooseVersion(\"2.0.0\")\n", + "\n", + "WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/pandas/frame.py:62: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pd.__version__) >= LooseVersion(\"0.24\"):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/pandas/missing/frame.py:81: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pd.__version__) < LooseVersion(\"1.0\"):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/pandas/missing/indexes.py:85: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pd.__version__) < LooseVersion(\"1.0\"):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/pandas/missing/indexes.py:191: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pd.__version__) < LooseVersion(\"1.0\"):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/pandas/missing/series.py:89: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pd.__version__) < LooseVersion(\"1.0\"):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/pandas/groupby.py:50: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pd.__version__) >= LooseVersion(\"1.3.0\"):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/fs/__init__.py:4: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('fs')`.\n", + "Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages\n", + " __import__(\"pkg_resources\").declare_namespace(__name__) # type: ignore\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/fs/opener/__init__.py:6: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('fs.opener')`.\n", + "Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages\n", + " __import__(\"pkg_resources\").declare_namespace(__name__) # type: ignore\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pkg_resources/__init__.py:2349: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('fs')`.\n", + "Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages\n", + " declare_namespace(parent)\n", + "\n" + ] + } + ], + "source": [ + "import pandas as pd\n", + "from ads.feature_store.feature_store import FeatureStore\n", + "from ads.feature_store.feature_group import FeatureGroup\n", + "from ads.feature_store.model_details import ModelDetails\n", + "from ads.feature_store.dataset import Dataset\n", + "from ads.feature_store.common.enums import DatasetIngestionMode\n", + "\n", + "from ads.feature_store.feature_group_expectation import ExpectationType\n", + "from great_expectations.core import ExpectationSuite, ExpectationConfiguration\n", + "from ads.feature_store.feature_store_registrar import FeatureStoreRegistrar" + ] + }, + { + "cell_type": "markdown", + "id": "d01c13f1", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.1. Exploration of data in feature store" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "b8d4a31d", + "metadata": { + "pycharm": { + "is_executing": true, + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:py.warnings:/tmp/ipykernel_4349/906484602.py:1: DtypeWarning: Columns (7,8) have mixed types. Specify dtype option on import or set low_memory=False.\n", + " flights_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/flights.csv\")[['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']]\n", + "\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
YEARMONTHDAYDAY_OF_WEEKAIRLINEFLIGHT_NUMBERORIGIN_AIRPORTDESTINATION_AIRPORT
02015114AS98ANCSEA
12015114AA2336LAXPBI
22015114US840SFOCLT
32015114AA258LAXMIA
42015114AS135SEAANC
\n", + "
" + ], + "text/plain": [ + " YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT \\\n", + "0 2015 1 1 4 AS 98 ANC \n", + "1 2015 1 1 4 AA 2336 LAX \n", + "2 2015 1 1 4 US 840 SFO \n", + "3 2015 1 1 4 AA 258 LAX \n", + "4 2015 1 1 4 AS 135 SEA \n", + "\n", + " DESTINATION_AIRPORT \n", + "0 SEA \n", + "1 PBI \n", + "2 CLT \n", + "3 MIA \n", + "4 ANC " + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "flights_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/flights.csv\")[['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']]\n", + "flights_df = flights_df.head(100)\n", + "flights_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "0263f6a7", + "metadata": { + "pycharm": { + "is_executing": true, + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IATA_CODEAIRPORTCITYSTATECOUNTRYLATITUDELONGITUDE
0ABELehigh Valley International AirportAllentownPAUSA40.65236-75.44040
1ABIAbilene Regional AirportAbileneTXUSA32.41132-99.68190
2ABQAlbuquerque International SunportAlbuquerqueNMUSA35.04022-106.60919
3ABRAberdeen Regional AirportAberdeenSDUSA45.44906-98.42183
4ABYSouthwest Georgia Regional AirportAlbanyGAUSA31.53552-84.19447
\n", + "
" + ], + "text/plain": [ + " IATA_CODE AIRPORT CITY STATE COUNTRY \\\n", + "0 ABE Lehigh Valley International Airport Allentown PA USA \n", + "1 ABI Abilene Regional Airport Abilene TX USA \n", + "2 ABQ Albuquerque International Sunport Albuquerque NM USA \n", + "3 ABR Aberdeen Regional Airport Aberdeen SD USA \n", + "4 ABY Southwest Georgia Regional Airport Albany GA USA \n", + "\n", + " LATITUDE LONGITUDE \n", + "0 40.65236 -75.44040 \n", + "1 32.41132 -99.68190 \n", + "2 35.04022 -106.60919 \n", + "3 45.44906 -98.42183 \n", + "4 31.53552 -84.19447 " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "airports_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airports.csv\")\n", + "airports_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "bfac65f4", + "metadata": { + "pycharm": { + "is_executing": true, + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IATA_CODEAIRLINE
0UAUnited Air Lines Inc.
1AAAmerican Airlines Inc.
2USUS Airways Inc.
3F9Frontier Airlines Inc.
4B6JetBlue Airways
\n", + "
" + ], + "text/plain": [ + " IATA_CODE AIRLINE\n", + "0 UA United Air Lines Inc.\n", + "1 AA American Airlines Inc.\n", + "2 US US Airways Inc.\n", + "3 F9 Frontier Airlines Inc.\n", + "4 B6 JetBlue Airways" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "airlines_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airlines.csv\")\n", + "airlines_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "88a21cff", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.2. Create feature store logical entities" + ] + }, + { + "cell_type": "markdown", + "id": "789489e5", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "#### 3.2.1 Feature Store\n", + "Feature store is the top level entity for feature store service" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "b490664b", + "metadata": { + "pycharm": { + "is_executing": true, + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "feature_store_resource = (\n", + " FeatureStore().\n", + " with_description(\"Data consisting of flights\").\n", + " with_compartment_id(compartment_id).\n", + " with_display_name(\"flights details\").\n", + " with_offline_config(metastore_id=metastore_id)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "b9bb4ef6", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "##### Create Feature Store\n", + "\n", + "Call the ```.create()``` method of the Feature store instance to create a feature store." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "b70ade05", + "metadata": { + "pycharm": { + "is_executing": true, + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "kind: featurestore\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", + " description: Data consisting of flights\n", + " displayName: flights details\n", + " id: 751D665EB6AE7360928F15705F9F0F48\n", + " offlineConfig:\n", + " metastoreId: ocid1.datacatalogmetastore.oc1.iad.amaaaaaanif7xwiaavhd2liaebamr3tbjzio3uw2lxuteoa5ejsfvhqufbsa\n", + "type: featureStore" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_store = feature_store_resource.create()\n", + "feature_store" + ] + }, + { + "cell_type": "markdown", + "id": "4e2bc9f0", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "#### 3.2.2 Entity\n", + "An entity is a group of semantically related features." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "a75bf559", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "kind: entity\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", + " description: description for flight details\n", + " featureStoreId: 751D665EB6AE7360928F15705F9F0F48\n", + " id: 843E320A28F319748425787F04BCD3B8\n", + " name: Flight details2\n", + "type: entity" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "entity = feature_store.create_entity(\n", + " display_name=\"Flight details2\",\n", + " description=\"description for flight details\"\n", + ")\n", + "entity" + ] + }, + { + "cell_type": "markdown", + "id": "6e4a0991", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "#### 3.2.3 Feature group\n", + "A feature group is an object that represents a logical group of time-series feature data as it is found in a datasource." + ] + }, + { + "cell_type": "markdown", + "id": "b59e6d7d", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "##### Flights Feature Group\n", + "\n", + "Create feature group for flights\n", + "\n", + "
\n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "9e0665c2", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Setting default log level to \"WARN\".\n", + "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", + "2023/07/14 04:29:29 NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", + "\n" + ] + } + ], + "source": [ + "feature_group_flights = (\n", + " FeatureGroup()\n", + " .with_feature_store_id(feature_store.id)\n", + " .with_primary_keys([\"FLIGHT_NUMBER\"])\n", + " .with_name(\"flights_feature_group\")\n", + " .with_entity_id(entity.id)\n", + " .with_compartment_id(compartment_id)\n", + " .with_schema_details_from_dataframe(flights_df)\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "753119fc", + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "kind: FeatureGroup\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", + " entityId: 843E320A28F319748425787F04BCD3B8\n", + " featureStoreId: 751D665EB6AE7360928F15705F9F0F48\n", + " id: C24E858807F4EBA22BF14C08B9A6E2DD\n", + " inputFeatureDetails:\n", + " - featureType: LONG\n", + " name: YEAR\n", + " orderNumber: 1\n", + " - featureType: LONG\n", + " name: MONTH\n", + " orderNumber: 2\n", + " - featureType: LONG\n", + " name: DAY\n", + " orderNumber: 3\n", + " - featureType: LONG\n", + " name: DAY_OF_WEEK\n", + " orderNumber: 4\n", + " - featureType: STRING\n", + " name: AIRLINE\n", + " orderNumber: 5\n", + " - featureType: LONG\n", + " name: FLIGHT_NUMBER\n", + " orderNumber: 6\n", + " - featureType: STRING\n", + " name: ORIGIN_AIRPORT\n", + " orderNumber: 7\n", + " - featureType: STRING\n", + " name: DESTINATION_AIRPORT\n", + " orderNumber: 8\n", + " isInferSchema: true\n", + " name: flights_feature_group\n", + " primaryKeys:\n", + " items:\n", + " - name: FLIGHT_NUMBER\n", + " statisticsConfig:\n", + " isEnabled: true\n", + "type: featureGroup" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_group_flights.create()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "7c7b8e9b", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "%3\n", + "\n", + "\n", + "751D665EB6AE7360928F15705F9F0F48\n", + "\n", + "flights details\n", + "Feature Store\n", + "751D665EB6AE7360928F15705F9F0F48\n", + "\n", + "\n", + "843E320A28F319748425787F04BCD3B8\n", + "\n", + "Flight details2\n", + "Entity\n", + "843E320A28F319748425787F04BCD3B8\n", + "\n", + "\n", + "751D665EB6AE7360928F15705F9F0F48->843E320A28F319748425787F04BCD3B8\n", + "\n", + "\n", + "\n", + "\n", + "C24E858807F4EBA22BF14C08B9A6E2DD\n", + "\n", + "flights_feature_group\n", + "Feature Group\n", + "C24E858807F4EBA22BF14C08B9A6E2DD\n", + "\n", + "\n", + "843E320A28F319748425787F04BCD3B8->C24E858807F4EBA22BF14C08B9A6E2DD\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "feature_group_flights.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "8d28daf4", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Hive Session ID = 59994193-ab1d-4749-8d21-17cc661a95c6\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:mlm_insights.builder:validating required components\n", + "INFO:mlm_insights.builder:required components validated\n", + "INFO:mlm_insights.builder.usage:Activating Minimal Insights Usage\n", + "INFO:mlm_insights.builder:Generating Runner object\n", + "INFO:mlm_insights.builder:Generating workflow request\n", + "INFO:mlm_insights.workflow:Fetching engine object\n", + "INFO:mlm_insights.workflow:Returning native engine object\n", + "INFO:mlm_insights.builder:Running Fugue Workflow\n", + "INFO:mlm_insights.workflow:Executing Fugue Workflow\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", + " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", + " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", + " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", + " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", + " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", + " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", + " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", + " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", + "[Stage 8:=============================> (1 + 1) / 2]\r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9399bf0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=2015.0, minimum=2015.0, maximum=2015.0, central_moments=[1.0, 0.0, 0.0, 0.0, 0.0]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9399d70>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9399cf0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9203930>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=1.0, minimum=1.0, maximum=1.0, central_moments=[1.0, 0.0, 0.0, 0.0, 0.0]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9399e70>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9399db0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef939a230>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=1.0, minimum=1.0, maximum=1.0, central_moments=[1.0, 0.0, 0.0, 0.0, 0.0]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef939a570>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef939a470>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef939a970>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=4.0, minimum=4.0, maximum=4.0, central_moments=[1.0, 0.0, 0.0, 0.0, 0.0]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef939abf0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef939aaf0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9399630>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef93a9030>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef93a9530>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=1711.5100000000002, minimum=17.0, maximum=7419.0, central_moments=[1.0, 0.0, 3509091.8299000002, 10157914842.877602, 55483811382672.16]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef93a97b0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef93a96b0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef93a9af0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef93a9bb0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef93a9cb0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9416830>)} sfc map\n", + "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", + "INFO:mlm_insights.builder:Profile Generated Successfully\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 2015.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [2015.0], 'density': [1.0]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [2015.0], 'frequency': [100]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 2015.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 201500.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 2015.0, 'q2': 2015.0, 'q3': 2015.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 2015.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 1.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [1.0], 'density': [1.0]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [1.0], 'frequency': [100]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 1.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 100.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 1.0, 'q2': 1.0, 'q3': 1.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 1.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 1.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [1.0], 'density': [1.0]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [1.0], 'frequency': [100]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 1.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 100.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 1.0, 'q2': 1.0, 'q3': 1.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 1.0\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 4.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [4.0], 'density': [1.0]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [4.0], 'frequency': [100]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 4.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 400.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 4.0, 'q2': 4.0, 'q3': 4.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 4.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='AA', estimate=14, lower_bound=14, upper_bound=14), FrequentItemEstimate(value='B6', estimate=12, lower_bound=12, upper_bound=12), FrequentItemEstimate(value='NK', estimate=11, lower_bound=11, upper_bound=11), FrequentItemEstimate(value='UA', estimate=11, lower_bound=11, upper_bound=11), FrequentItemEstimate(value='AS', estimate=11, lower_bound=11, upper_bound=11), FrequentItemEstimate(value='DL', estimate=11, lower_bound=11, upper_bound=11), FrequentItemEstimate(value='US', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='OO', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='EV', estimate=7, lower_bound=7, upper_bound=7), FrequentItemEstimate(value='HA', estimate=5, lower_bound=5, upper_bound=5)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 12.000000327825557 in Distinct count SFC, upper bound = 12.000599478849342, lower bound = 12.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 88, 'percentage': 88.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['AA']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 12.000000327825557 in Distinct count SFC, upper bound = 12.000599478849342, lower bound = 12.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 12.000000327825557\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: 1.5452988004009884\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 1873.257011170651\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 17.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 1905.0\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 7402.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [17.0, 272.2413793103448, 527.4827586206897, 782.7241379310344, 1037.9655172413793, 1293.2068965517242, 1548.4482758620688, 1803.6896551724137, 2058.9310344827586, 2314.1724137931033, 2569.4137931034484, 2824.655172413793, 3079.8965517241377, 3335.137931034483, 3590.3793103448274, 3845.6206896551726, 4100.862068965517, 4356.103448275862, 4611.3448275862065, 4866.586206896552, 5121.827586206897, 5377.068965517241, 5632.310344827586, 5887.551724137931, 6142.793103448275, 6398.0344827586205, 6653.275862068966, 6908.517241379311, 7163.758620689655, 7419.0], 'density': [0.22, 0.1, 0.10999999999999999, 0.04999999999999999, 0.08999999999999997, 0.07000000000000006, 0.040000000000000036, 0.039999999999999925, 0.040000000000000036, 0.06999999999999995, 0.010000000000000009, 0.010000000000000009, 0.0, 0.0, 0.0, 0.0, 0.010000000000000009, 0.010000000000000009, 0.010000000000000009, 0.0, 0.030000000000000027, 0.039999999999999925, 0.010000000000000009, 0.0, 0.010000000000000009, 0.0, 0.0, 0.0, 0.020000000000000018, 0.010000000000000009]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 3509091.8299000002\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [17.0, 272.2413793103448, 527.4827586206897, 782.7241379310344, 1037.9655172413793, 1293.2068965517242, 1548.4482758620688, 1803.6896551724137, 2058.9310344827586, 2314.1724137931033, 2569.4137931034484, 2824.655172413793, 3079.8965517241377, 3335.137931034483, 3590.3793103448274, 3845.6206896551726, 4100.862068965517, 4356.103448275862, 4611.3448275862065, 4866.586206896552, 5121.827586206897, 5377.068965517241, 5632.310344827586, 5887.551724137931, 6142.793103448275, 6398.0344827586205, 6653.275862068966, 6908.517241379311, 7163.758620689655, 7419.0], 'frequency': [22, 10, 11, 5, 9, 7, 4, 4, 4, 7, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 3, 4, 1, 0, 1, 0, 0, 0, 2, 1]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 7419.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 96.00002264977122 in Distinct count SFC, upper bound = 96.00481585896145, lower bound = 96.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 96.00002264977122\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 171151.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 371.0, 'q2': 1162.0, 'q3': 2276.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 1711.5100000000002\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.5058509315336428\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='ANC', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='LAS', estimate=9, lower_bound=9, upper_bound=9), FrequentItemEstimate(value='SJU', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='LAX', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='SFO', estimate=5, lower_bound=5, upper_bound=5), FrequentItemEstimate(value='PHX', estimate=5, lower_bound=5, upper_bound=5), FrequentItemEstimate(value='SEA', estimate=5, lower_bound=5, upper_bound=5), FrequentItemEstimate(value='HNL', estimate=4, lower_bound=4, upper_bound=4), FrequentItemEstimate(value='ORD', estimate=4, lower_bound=4, upper_bound=4), FrequentItemEstimate(value='PDX', estimate=3, lower_bound=3, upper_bound=3)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 44.000004698833415 in Distinct count SFC, upper bound = 44.00220158609522, lower bound = 44.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 56, 'percentage': 56.00000000000001}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['ANC']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 44.000004698833415 in Distinct count SFC, upper bound = 44.00220158609522, lower bound = 44.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 44.000004698833415\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='MIA', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='IAH', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='MSP', estimate=9, lower_bound=9, upper_bound=9), FrequentItemEstimate(value='SEA', estimate=9, lower_bound=9, upper_bound=9), FrequentItemEstimate(value='ATL', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='DFW', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='MCO', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='DEN', estimate=5, lower_bound=5, upper_bound=5), FrequentItemEstimate(value='PHX', estimate=4, lower_bound=4, upper_bound=4), FrequentItemEstimate(value='CLT', estimate=4, lower_bound=4, upper_bound=4)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 29.00000201662398 in Distinct count SFC, upper bound = 29.00144996499259, lower bound = 29.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 71, 'percentage': 71.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['MIA', 'IAH']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 29.00000201662398 in Distinct count SFC, upper bound = 29.00144996499259, lower bound = 29.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 29.00000201662398\n", + "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 100.0\n", + "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", + "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", + "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", + "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", + "│ C24E858807F4EBA22BF14C08B9A6E2DD │ FEATURE_GROUP │ Succeeded │ None │\n", + "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" + ] + } + ], + "source": [ + "feature_group_flights.materialise(flights_df)" + ] + }, + { + "cell_type": "markdown", + "id": "41d796d5", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "##### Airport Feature Group\n", + "\n", + "Create feature group for airport" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "4c247dde", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{\"meta\": {}, \"expectation_type\": \"expect_column_values_to_be_between\", \"kwargs\": {\"column\": \"LONGITUDE\", \"min_value\": -1.0, \"max_value\": 1.0}}" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "expectation_suite_airports = ExpectationSuite(\n", + " expectation_suite_name=\"test_airports_df\"\n", + ")\n", + "expectation_suite_airports.add_expectation(\n", + " ExpectationConfiguration(\n", + " expectation_type=\"expect_column_values_to_not_be_null\",\n", + " kwargs={\"column\": \"IATA_CODE\"},\n", + " )\n", + ")\n", + "expectation_suite_airports.add_expectation(\n", + " ExpectationConfiguration(\n", + " expectation_type=\"expect_column_values_to_be_between\",\n", + " kwargs={\"column\": \"LATITUDE\", \"min_value\": -1.0, \"max_value\": 1.0},\n", + " )\n", + ")\n", + "\n", + "expectation_suite_airports.add_expectation(\n", + " ExpectationConfiguration(\n", + " expectation_type=\"expect_column_values_to_be_between\",\n", + " kwargs={\"column\": \"LONGITUDE\", \"min_value\": -1.0, \"max_value\": 1.0},\n", + " )\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "81863e53", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", + "\n" + ] + } + ], + "source": [ + "feature_group_airports = (\n", + " FeatureGroup()\n", + " .with_feature_store_id(feature_store.id)\n", + " .with_primary_keys([\"IATA_CODE\"])\n", + " .with_name(\"airport_feature_group\")\n", + " .with_entity_id(entity.id)\n", + " .with_compartment_id(compartment_id)\n", + " .with_schema_details_from_dataframe(airports_df)\n", + " .with_expectation_suite(\n", + " expectation_suite=expectation_suite_airports,\n", + " expectation_type=ExpectationType.LENIENT,\n", + " )\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "e1920d4c", + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "kind: FeatureGroup\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", + " entityId: 843E320A28F319748425787F04BCD3B8\n", + " expectationDetails:\n", + " createRuleDetails:\n", + " - arguments:\n", + " column: IATA_CODE\n", + " levelType: ERROR\n", + " name: Rule-0\n", + " ruleType: expect_column_values_to_not_be_null\n", + " - arguments:\n", + " column: LATITUDE\n", + " max_value: 1.0\n", + " min_value: -1.0\n", + " levelType: ERROR\n", + " name: Rule-1\n", + " ruleType: expect_column_values_to_be_between\n", + " - arguments:\n", + " column: LONGITUDE\n", + " max_value: 1.0\n", + " min_value: -1.0\n", + " levelType: ERROR\n", + " name: Rule-2\n", + " ruleType: expect_column_values_to_be_between\n", + " expectationType: LENIENT\n", + " name: test_airports_df\n", + " validationEngineType: GREAT_EXPECTATIONS\n", + " featureStoreId: 751D665EB6AE7360928F15705F9F0F48\n", + " id: C1771CFDA79A082BB9FB85D9E5FCB192\n", + " inputFeatureDetails:\n", + " - featureType: STRING\n", + " name: IATA_CODE\n", + " orderNumber: 1\n", + " - featureType: STRING\n", + " name: AIRPORT\n", + " orderNumber: 2\n", + " - featureType: STRING\n", + " name: CITY\n", + " orderNumber: 3\n", + " - featureType: STRING\n", + " name: STATE\n", + " orderNumber: 4\n", + " - featureType: STRING\n", + " name: COUNTRY\n", + " orderNumber: 5\n", + " - featureType: DOUBLE\n", + " name: LATITUDE\n", + " orderNumber: 6\n", + " - featureType: DOUBLE\n", + " name: LONGITUDE\n", + " orderNumber: 7\n", + " isInferSchema: true\n", + " name: airport_feature_group\n", + " primaryKeys:\n", + " items:\n", + " - name: IATA_CODE\n", + " statisticsConfig:\n", + " isEnabled: true\n", + "type: featureGroup" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_group_airports.create()" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "7a78eaa2", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:great_expectations.validator.validator:\t3 expectation(s) included in expectation_suite.\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "db29009398704583b95af2e91841296e", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Calculating Metrics: 0%| | 0/16 [00:00), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8efbe47830>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9584930>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9584270>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9584870>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef958f230>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef958f670>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef958f630>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef958fab0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef958fa70>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9596130>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=38.9812439184953, minimum=13.48345, maximum=71.28545, central_moments=[1.0, 8.909626780690911e-17, 74.01537930806269, 262.87069420949706, 26574.825385423774]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9596330>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9596230>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9596570>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=-98.37896445141065, minimum=-176.64603, maximum=-64.79856, central_moments=[1.0, 0.0, 461.80848194502215, -11904.62460720004, 932401.3978279813]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef95967b0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef95966b0>)} sfc map\n", + "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", + "INFO:mlm_insights.builder:Profile Generated Successfully\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 14, 'percentage': 4.3478260869565215}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 308.000234832572\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='TX', estimate=24, lower_bound=24, upper_bound=24), FrequentItemEstimate(value='CA', estimate=22, lower_bound=22, upper_bound=22), FrequentItemEstimate(value='AK', estimate=19, lower_bound=19, upper_bound=19), FrequentItemEstimate(value='FL', estimate=17, lower_bound=17, upper_bound=17), FrequentItemEstimate(value='MI', estimate=15, lower_bound=15, upper_bound=15), FrequentItemEstimate(value='NY', estimate=14, lower_bound=14, upper_bound=14), FrequentItemEstimate(value='CO', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='NC', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='MN', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='WI', estimate=8, lower_bound=8, upper_bound=8)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 268, 'percentage': 83.22981366459628}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['TX']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 54.00000710785499\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='USA', estimate=322, lower_bound=322, upper_bound=322)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 321, 'percentage': 99.68944099378882}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['USA']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: 0.41281856359758584\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 8.603219124726667\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 13.48345\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 9.529050000000005\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 57.802\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'density': [0.003134796238244514, 0.0, 0.015673981191222573, 0.01567398119122257, 0.0031347962382445166, 0.0, 0.025078369905956105, 0.021943573667711602, 0.07210031347962384, 0.07836990595611285, 0.10658307210031348, 0.0658307210031348, 0.09404388714733536, 0.11598746081504707, 0.13479623824451414, 0.07836990595611282, 0.06896551724137934, 0.037617554858934144, 0.0, 0.006269592476489061, 0.0, 0.01253918495297801, 0.01567398119122254, 0.012539184952978122, 0.0, 0.0031347962382445305, 0.0031347962382444194, 0.0, 0.0031347962382445305, 0.006269592476489061]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 74.01537930806269\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'frequency': [1, 0, 5, 5, 1, 0, 8, 7, 23, 25, 34, 21, 30, 37, 43, 25, 22, 12, 0, 2, 0, 4, 5, 4, 0, 1, 1, 0, 1, 2]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 71.28545\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 12435.01681\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 33.64044, 'q2': 39.29761, 'q3': 43.16949}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 38.9812439184953\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.850946460274213\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: -1.199562407919743\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 21.489729685247838\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: -176.64603\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 28.386920000000003\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 111.84747\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'density': [0.0, 0.003134796238244514, 0.003134796238244514, 0.009404388714733541, 0.009404388714733543, 0.01567398119122257, 0.006269592476489033, 0.0031347962382445096, 0.006269592476489033, 0.006269592476489019, 0.00940438871473355, 0.006269592476489033, 0.0, 0.018808777429467072, 0.05642633228840126, 0.040752351097178674, 0.05015673981191224, 0.03448275862068967, 0.043887147335423204, 0.037617554858934144, 0.09090909090909094, 0.08463949843260188, 0.08777429467084641, 0.10031347962382442, 0.09404388714733547, 0.08150470219435735, 0.056426332288401215, 0.028213166144200663, 0.00940438871473348, 0.006269592476489061]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 461.80848194502215\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'frequency': [0, 1, 1, 3, 3, 5, 2, 1, 2, 2, 3, 2, 0, 6, 18, 13, 16, 11, 14, 12, 29, 27, 28, 32, 30, 26, 18, 9, 3, 2]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: -64.79856\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: -31382.88966\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': -110.94103, 'q2': -93.40307, 'q3': -82.55411}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: -98.37896445141065\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.3719894513293207\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 322.0\n", + "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", + "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", + "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", + "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", + "│ C1771CFDA79A082BB9FB85D9E5FCB192 │ FEATURE_GROUP │ Succeeded │ None │\n", + "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" + ] + } + ], + "source": [ + "feature_group_airports.materialise(airports_df)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "44277176", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "%3\n", + "\n", + "\n", + "751D665EB6AE7360928F15705F9F0F48\n", + "\n", + "flights details\n", + "Feature Store\n", + "751D665EB6AE7360928F15705F9F0F48\n", + "\n", + "\n", + "843E320A28F319748425787F04BCD3B8\n", + "\n", + "Flight details2\n", + "Entity\n", + "843E320A28F319748425787F04BCD3B8\n", + "\n", + "\n", + "751D665EB6AE7360928F15705F9F0F48->843E320A28F319748425787F04BCD3B8\n", + "\n", + "\n", + "\n", + "\n", + "C1771CFDA79A082BB9FB85D9E5FCB192\n", + "\n", + "airport_feature_group\n", + "Feature Group\n", + "C1771CFDA79A082BB9FB85D9E5FCB192\n", + "\n", + "\n", + "843E320A28F319748425787F04BCD3B8->C1771CFDA79A082BB9FB85D9E5FCB192\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "feature_group_airports.show()" + ] + }, + { + "cell_type": "markdown", + "id": "d842551d", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "##### Airlines Feature Group\n", + "\n", + "Create feature group for airlines\n", + "\n", + "
\n", + " \n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "31a33a56", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "f3c7a4c2", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{\"meta\": {}, \"expectation_type\": \"expect_column_values_to_not_be_null\", \"kwargs\": {\"column\": \"IATA_CODE\"}}" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "expectation_suite_airlines = ExpectationSuite(\n", + " expectation_suite_name=\"test_airlines_df\"\n", + ")\n", + "expectation_suite_airlines.add_expectation(\n", + " ExpectationConfiguration(\n", + " expectation_type=\"expect_column_values_to_not_be_null\",\n", + " kwargs={\"column\": \"IATA_CODE\"},\n", + " )\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "1b9ad0dc", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", + "\n" + ] + } + ], + "source": [ + "feature_group_airlines = (\n", + " FeatureGroup()\n", + " .with_feature_store_id(feature_store.id)\n", + " .with_primary_keys([\"IATA_CODE\"])\n", + " .with_name(\"airlines_feature_group\")\n", + " .with_entity_id(entity.id)\n", + " .with_compartment_id(compartment_id)\n", + " .with_schema_details_from_dataframe(airlines_df)\n", + " .with_expectation_suite(\n", + " expectation_suite=expectation_suite_airlines,\n", + " expectation_type=ExpectationType.STRICT,\n", + " )\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "35cea00f", + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "kind: FeatureGroup\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", + " entityId: 843E320A28F319748425787F04BCD3B8\n", + " expectationDetails:\n", + " createRuleDetails:\n", + " - arguments:\n", + " column: IATA_CODE\n", + " levelType: ERROR\n", + " name: Rule-0\n", + " ruleType: expect_column_values_to_not_be_null\n", + " expectationType: STRICT\n", + " name: test_airlines_df\n", + " validationEngineType: GREAT_EXPECTATIONS\n", + " featureStoreId: 751D665EB6AE7360928F15705F9F0F48\n", + " id: 4E21D2D878A101E8804837CAD6499FD9\n", + " inputFeatureDetails:\n", + " - featureType: STRING\n", + " name: IATA_CODE\n", + " orderNumber: 1\n", + " - featureType: STRING\n", + " name: AIRLINE\n", + " orderNumber: 2\n", + " isInferSchema: true\n", + " name: airlines_feature_group\n", + " primaryKeys:\n", + " items:\n", + " - name: IATA_CODE\n", + " statisticsConfig:\n", + " isEnabled: true\n", + "type: featureGroup" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_group_airlines.create()" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "ae7c7ff9", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:great_expectations.validator.validator:\t1 expectation(s) included in expectation_suite.\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "971d31d06c77444eadef7392d6903b71", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Calculating Metrics: 0%| | 0/6 [00:00), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef956b430>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef95cdc30>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef95cdb70>)} sfc map\n", + "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", + "INFO:mlm_insights.builder:Profile Generated Successfully\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 14.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='UA', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='AA', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='NK', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='VX', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='OO', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='WN', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='US', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='DL', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='AS', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='B6', estimate=1, lower_bound=1, upper_bound=1)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 14, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 14.000000452001906 in Distinct count SFC, upper bound = 14.000699461533127, lower bound = 14.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['UA', 'AA']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 14.000000452001906 in Distinct count SFC, upper bound = 14.000699461533127, lower bound = 14.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 14\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 14.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='Skywest Airlines Inc.', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='American Eagle Airlines Inc.', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='Frontier Airlines Inc.', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='Atlantic Southeast Airlines', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='Southwest Airlines Co.', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='Hawaiian Airlines Inc.', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='American Airlines Inc.', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='Virgin America', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='Spirit Air Lines', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='JetBlue Airways', estimate=1, lower_bound=1, upper_bound=1)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 14, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 14.000000452001906 in Distinct count SFC, upper bound = 14.000699461533127, lower bound = 14.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['Skywest Airlines Inc.', 'American Eagle Airlines Inc.']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 14.000000452001906 in Distinct count SFC, upper bound = 14.000699461533127, lower bound = 14.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 14\n", + "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 14.0\n", + "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", + "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", + "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", + "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", + "│ 4E21D2D878A101E8804837CAD6499FD9 │ FEATURE_GROUP │ Succeeded │ None │\n", + "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" + ] + } + ], + "source": [ + "feature_group_airlines.materialise(airlines_df)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "1c4dcf81", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "%3\n", + "\n", + "\n", + "751D665EB6AE7360928F15705F9F0F48\n", + "\n", + "flights details\n", + "Feature Store\n", + "751D665EB6AE7360928F15705F9F0F48\n", + "\n", + "\n", + "843E320A28F319748425787F04BCD3B8\n", + "\n", + "Flight details2\n", + "Entity\n", + "843E320A28F319748425787F04BCD3B8\n", + "\n", + "\n", + "751D665EB6AE7360928F15705F9F0F48->843E320A28F319748425787F04BCD3B8\n", + "\n", + "\n", + "\n", + "\n", + "4E21D2D878A101E8804837CAD6499FD9\n", + "\n", + "airlines_feature_group\n", + "Feature Group\n", + "4E21D2D878A101E8804837CAD6499FD9\n", + "\n", + "\n", + "843E320A28F319748425787F04BCD3B8->4E21D2D878A101E8804837CAD6499FD9\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "feature_group_airlines.show()" + ] + }, + { + "cell_type": "markdown", + "id": "cb4e05d6", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.3. Explore feature groups" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "a00444ad", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
nametypefeature_group_id
0YEARLONGC24E858807F4EBA22BF14C08B9A6E2DD
1MONTHLONGC24E858807F4EBA22BF14C08B9A6E2DD
2DAYLONGC24E858807F4EBA22BF14C08B9A6E2DD
3DAY_OF_WEEKLONGC24E858807F4EBA22BF14C08B9A6E2DD
4AIRLINESTRINGC24E858807F4EBA22BF14C08B9A6E2DD
5FLIGHT_NUMBERLONGC24E858807F4EBA22BF14C08B9A6E2DD
6ORIGIN_AIRPORTSTRINGC24E858807F4EBA22BF14C08B9A6E2DD
7DESTINATION_AIRPORTSTRINGC24E858807F4EBA22BF14C08B9A6E2DD
\n", + "
" + ], + "text/plain": [ + " name type feature_group_id\n", + "0 YEAR LONG C24E858807F4EBA22BF14C08B9A6E2DD\n", + "1 MONTH LONG C24E858807F4EBA22BF14C08B9A6E2DD\n", + "2 DAY LONG C24E858807F4EBA22BF14C08B9A6E2DD\n", + "3 DAY_OF_WEEK LONG C24E858807F4EBA22BF14C08B9A6E2DD\n", + "4 AIRLINE STRING C24E858807F4EBA22BF14C08B9A6E2DD\n", + "5 FLIGHT_NUMBER LONG C24E858807F4EBA22BF14C08B9A6E2DD\n", + "6 ORIGIN_AIRPORT STRING C24E858807F4EBA22BF14C08B9A6E2DD\n", + "7 DESTINATION_AIRPORT STRING C24E858807F4EBA22BF14C08B9A6E2DD" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_group_flights.get_features_df()" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "1e492391", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
nametypefeature_group_id
0IATA_CODESTRINGC1771CFDA79A082BB9FB85D9E5FCB192
1AIRPORTSTRINGC1771CFDA79A082BB9FB85D9E5FCB192
2CITYSTRINGC1771CFDA79A082BB9FB85D9E5FCB192
3STATESTRINGC1771CFDA79A082BB9FB85D9E5FCB192
4COUNTRYSTRINGC1771CFDA79A082BB9FB85D9E5FCB192
5LATITUDEDOUBLEC1771CFDA79A082BB9FB85D9E5FCB192
6LONGITUDEDOUBLEC1771CFDA79A082BB9FB85D9E5FCB192
\n", + "
" + ], + "text/plain": [ + " name type feature_group_id\n", + "0 IATA_CODE STRING C1771CFDA79A082BB9FB85D9E5FCB192\n", + "1 AIRPORT STRING C1771CFDA79A082BB9FB85D9E5FCB192\n", + "2 CITY STRING C1771CFDA79A082BB9FB85D9E5FCB192\n", + "3 STATE STRING C1771CFDA79A082BB9FB85D9E5FCB192\n", + "4 COUNTRY STRING C1771CFDA79A082BB9FB85D9E5FCB192\n", + "5 LATITUDE DOUBLE C1771CFDA79A082BB9FB85D9E5FCB192\n", + "6 LONGITUDE DOUBLE C1771CFDA79A082BB9FB85D9E5FCB192" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_group_airports.get_features_df()" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "dbde287a", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
nametypefeature_group_id
0IATA_CODESTRING4E21D2D878A101E8804837CAD6499FD9
1AIRLINESTRING4E21D2D878A101E8804837CAD6499FD9
\n", + "
" + ], + "text/plain": [ + " name type feature_group_id\n", + "0 IATA_CODE STRING 4E21D2D878A101E8804837CAD6499FD9\n", + "1 AIRLINE STRING 4E21D2D878A101E8804837CAD6499FD9" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_group_airlines.get_features_df()" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "9c15fb2e", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[Stage 35:> (0 + 1) / 1]\r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+----+-----+---+-----------+-------+-------------+--------------+-------------------+\n", + "|YEAR|MONTH|DAY|DAY_OF_WEEK|AIRLINE|FLIGHT_NUMBER|ORIGIN_AIRPORT|DESTINATION_AIRPORT|\n", + "+----+-----+---+-----------+-------+-------------+--------------+-------------------+\n", + "|2015| 1| 1| 4| B6| 1030| BQN| MCO|\n", + "|2015| 1| 1| 4| B6| 262| SJU| BOS|\n", + "|2015| 1| 1| 4| B6| 2134| SJU| MCO|\n", + "|2015| 1| 1| 4| B6| 730| BQN| MCO|\n", + "|2015| 1| 1| 4| B6| 768| PSE| MCO|\n", + "|2015| 1| 1| 4| B6| 2276| SJU| BDL|\n", + "|2015| 1| 1| 4| US| 602| ORD| PHX|\n", + "|2015| 1| 1| 4| AS| 695| GEG| SEA|\n", + "|2015| 1| 1| 4| HA| 102| HNL| ITO|\n", + "|2015| 1| 1| 4| OO| 5467| ONT| SFO|\n", + "+----+-----+---+-----------+-------+-------------+--------------+-------------------+\n", + "only showing top 10 rows\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + } + ], + "source": [ + "feature_group_flights.select().show()" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "1fa80478", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[Stage 38:> (0 + 1) / 1]\r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+---------+--------------------+-------------+-----+-------+--------+----------+\n", + "|IATA_CODE| AIRPORT| CITY|STATE|COUNTRY|LATITUDE| LONGITUDE|\n", + "+---------+--------------------+-------------+-----+-------+--------+----------+\n", + "| ABE|Lehigh Valley Int...| Allentown| PA| USA|40.65236| -75.4404|\n", + "| ABI|Abilene Regional ...| Abilene| TX| USA|32.41132| -99.6819|\n", + "| ABQ|Albuquerque Inter...| Albuquerque| NM| USA|35.04022|-106.60919|\n", + "| ABR|Aberdeen Regional...| Aberdeen| SD| USA|45.44906| -98.42183|\n", + "| ABY|Southwest Georgia...| Albany| GA| USA|31.53552| -84.19447|\n", + "| ACK|Nantucket Memoria...| Nantucket| MA| USA|41.25305| -70.06018|\n", + "| ACT|Waco Regional Air...| Waco| TX| USA|31.61129| -97.23052|\n", + "| ACV| Arcata Airport|Arcata/Eureka| CA| USA|40.97812|-124.10862|\n", + "| ACY|Atlantic City Int...|Atlantic City| NJ| USA|39.45758| -74.57717|\n", + "| ADK| Adak Airport| Adak| AK| USA|51.87796|-176.64603|\n", + "+---------+--------------------+-------------+-----+-------+--------+----------+\n", + "only showing top 10 rows\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + } + ], + "source": [ + "feature_group_airports.select().show()" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "dbb37e5c", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+---------+--------------------+\n", + "|IATA_CODE| AIRLINE|\n", + "+---------+--------------------+\n", + "| NK| Spirit Air Lines|\n", + "| WN|Southwest Airline...|\n", + "| DL|Delta Air Lines Inc.|\n", + "| EV|Atlantic Southeas...|\n", + "| HA|Hawaiian Airlines...|\n", + "| MQ|American Eagle Ai...|\n", + "| VX| Virgin America|\n", + "| UA|United Air Lines ...|\n", + "| AA|American Airlines...|\n", + "| US| US Airways Inc.|\n", + "+---------+--------------------+\n", + "only showing top 10 rows\n", + "\n" + ] + } + ], + "source": [ + "feature_group_airlines.select().show()" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "e67ea0f5", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IATA_CODEAIRLINE
Count{'total_count': 14, 'missing_count': 0, 'missing_count_percentage': 0.0}{'total_count': 14, 'missing_count': 0, 'missing_count_percentage': 0.0}
TopKFrequentElements{'value': [{'value': 'UA', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'AA', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'NK', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'VX', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'OO', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'WN', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'US', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'DL', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'AS', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'B6', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}]}{'value': [{'value': 'Skywest Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'American Eagle Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Frontier Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Atlantic Southeast Airlines', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Southwest Airlines Co.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Hawaiian Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'American Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Virgin America', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Spirit Air Lines', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'JetBlue Airways', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}]}
TypeMetric{'string_type_count': 14, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}{'string_type_count': 14, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}
DuplicateCount{'count': 0, 'percentage': 0.0}{'count': 0, 'percentage': 0.0}
Mode{'value': ['UA', 'AA']}{'value': ['Skywest Airlines Inc.', 'American Eagle Airlines Inc.']}
DistinctCount{'value': 14}{'value': 14}
\n", + "
" + ], + "text/plain": [ + " IATA_CODE \\\n", + "Count {'total_count': 14, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", + "TopKFrequentElements {'value': [{'value': 'UA', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'AA', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'NK', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'VX', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'OO', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'WN', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'US', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'DL', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'AS', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'B6', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}]} \n", + "TypeMetric {'string_type_count': 14, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", + "DuplicateCount {'count': 0, 'percentage': 0.0} \n", + "Mode {'value': ['UA', 'AA']} \n", + "DistinctCount {'value': 14} \n", + "\n", + " AIRLINE \n", + "Count {'total_count': 14, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", + "TopKFrequentElements {'value': [{'value': 'Skywest Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'American Eagle Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Frontier Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Atlantic Southeast Airlines', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Southwest Airlines Co.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Hawaiian Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'American Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Virgin America', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Spirit Air Lines', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'JetBlue Airways', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}]} \n", + "TypeMetric {'string_type_count': 14, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", + "DuplicateCount {'count': 0, 'percentage': 0.0} \n", + "Mode {'value': ['Skywest Airlines Inc.', 'American Eagle Airlines Inc.']} \n", + "DistinctCount {'value': 14} " + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_group_airlines.get_statistics().to_pandas()" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "583db211", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
YEARMONTHDAYDAY_OF_WEEKAIRLINEFLIGHT_NUMBERORIGIN_AIRPORTDESTINATION_AIRPORT
Skewness{'value': None}{'value': None}{'value': None}{'value': None}NaN{'value': 1.545298800400988}NaNNaN
StandardDeviation{'value': 0.0}{'value': 0.0}{'value': 0.0}{'value': 0.0}NaN{'value': 1873.257011170651}NaNNaN
Min{'value': 2015.0}{'value': 1.0}{'value': 1.0}{'value': 4.0}NaN{'value': 17.0}NaNNaN
IsConstantFeature{'value': True}{'value': True}{'value': True}{'value': True}NaN{'value': False}NaNNaN
IQR{'value': 0.0}{'value': 0.0}{'value': 0.0}{'value': 0.0}NaN{'value': 1905.0}NaNNaN
Range{'value': 0.0}{'value': 0.0}{'value': 0.0}{'value': 0.0}NaN{'value': 7402.0}NaNNaN
ProbabilityDistribution{'bins': [2015.0], 'density': [1.0]}{'bins': [1.0], 'density': [1.0]}{'bins': [1.0], 'density': [1.0]}{'bins': [4.0], 'density': [1.0]}NaN{'bins': [17.0, 272.2413793103448, 527.4827586206897, 782.7241379310344, 1037.9655172413793, 1293.2068965517242, 1548.4482758620688, 1803.6896551724137, 2058.9310344827586, 2314.1724137931033, 2569.4137931034484, 2824.655172413793, 3079.8965517241377, 3335.137931034483, 3590.3793103448274, 3845.6206896551726, 4100.862068965517, 4356.103448275862, 4611.3448275862065, 4866.586206896552, 5121.827586206897, 5377.068965517241, 5632.310344827586, 5887.551724137931, 6142.793103448275, 6398.0344827586205, 6653.275862068966, 6908.517241379311, 7163.758620689655, 7419.0], 'density': [0.22, 0.1, 0.10999999999999902, 0.049999999999999004, 0.08999999999999901, 0.07, 0.04, 0.039999999999999, 0.04, 0.06999999999999901, 0.01, 0.01, 0.0, 0.0, 0.0, 0.0, 0.01, 0.01, 0.01, 0.0, 0.030000000000000002, 0.039999999999999, 0.01, 0.0, 0.01, 0.0, 0.0, 0.0, 0.02, 0.01]}NaNNaN
Variance{'value': 0.0}{'value': 0.0}{'value': 0.0}{'value': 0.0}NaN{'value': 3509091.8299000002}NaNNaN
TypeMetric{'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}{'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}{'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}{'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}{'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}{'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}{'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}{'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}
FrequencyDistribution{'bins': [2015.0], 'frequency': [100]}{'bins': [1.0], 'frequency': [100]}{'bins': [1.0], 'frequency': [100]}{'bins': [4.0], 'frequency': [100]}NaN{'bins': [17.0, 272.2413793103448, 527.4827586206897, 782.7241379310344, 1037.9655172413793, 1293.2068965517242, 1548.4482758620688, 1803.6896551724137, 2058.9310344827586, 2314.1724137931033, 2569.4137931034484, 2824.655172413793, 3079.8965517241377, 3335.137931034483, 3590.3793103448274, 3845.6206896551726, 4100.862068965517, 4356.103448275862, 4611.3448275862065, 4866.586206896552, 5121.827586206897, 5377.068965517241, 5632.310344827586, 5887.551724137931, 6142.793103448275, 6398.0344827586205, 6653.275862068966, 6908.517241379311, 7163.758620689655, 7419.0], 'frequency': [22, 10, 11, 5, 9, 7, 4, 4, 4, 7, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 3, 4, 1, 0, 1, 0, 0, 0, 2, 1]}NaNNaN
Count{'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0}{'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0}{'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0}{'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0}{'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0}{'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0}{'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0}{'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0}
Max{'value': 2015.0}{'value': 1.0}{'value': 1.0}{'value': 4.0}NaN{'value': 7419.0}NaNNaN
DistinctCount{'value': 1}{'value': 1}{'value': 1}{'value': 1}{'value': 12}{'value': 96}{'value': 44}{'value': 29}
Sum{'value': 201500.0}{'value': 100.0}{'value': 100.0}{'value': 400.0}NaN{'value': 171151.0}NaNNaN
IsQuasiConstantFeature{'value': True}{'value': True}{'value': True}{'value': True}NaN{'value': False}NaNNaN
Quartiles{'q1': 2015.0, 'q2': 2015.0, 'q3': 2015.0}{'q1': 1.0, 'q2': 1.0, 'q3': 1.0}{'q1': 1.0, 'q2': 1.0, 'q3': 1.0}{'q1': 4.0, 'q2': 4.0, 'q3': 4.0}NaN{'q1': 371.0, 'q2': 1162.0, 'q3': 2276.0}NaNNaN
Mean{'value': 2015.0}{'value': 1.0}{'value': 1.0}{'value': 4.0}NaN{'value': 1711.5100000000002}NaNNaN
Kurtosis{'value': None}{'value': None}{'value': None}{'value': None}NaN{'value': 1.505850931533642}NaNNaN
TopKFrequentElementsNaNNaNNaNNaN{'value': [{'value': 'AA', 'estimate': 14, 'lower_bound': 14, 'upper_bound': 14}, {'value': 'B6', 'estimate': 12, 'lower_bound': 12, 'upper_bound': 12}, {'value': 'NK', 'estimate': 11, 'lower_bound': 11, 'upper_bound': 11}, {'value': 'UA', 'estimate': 11, 'lower_bound': 11, 'upper_bound': 11}, {'value': 'AS', 'estimate': 11, 'lower_bound': 11, 'upper_bound': 11}, {'value': 'DL', 'estimate': 11, 'lower_bound': 11, 'upper_bound': 11}, {'value': 'US', 'estimate': 8, 'lower_bound': 8, 'upper_bound': 8}, {'value': 'OO', 'estimate': 8, 'lower_bound': 8, 'upper_bound': 8}, {'value': 'EV', 'estimate': 7, 'lower_bound': 7, 'upper_bound': 7}, {'value': 'HA', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}]}NaN{'value': [{'value': 'ANC', 'estimate': 10, 'lower_bound': 10, 'upper_bound': 10}, {'value': 'LAS', 'estimate': 9, 'lower_bound': 9, 'upper_bound': 9}, {'value': 'SJU', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'LAX', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'SFO', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}, {'value': 'PHX', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}, {'value': 'SEA', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}, {'value': 'HNL', 'estimate': 4, 'lower_bound': 4, 'upper_bound': 4}, {'value': 'ORD', 'estimate': 4, 'lower_bound': 4, 'upper_bound': 4}, {'value': 'PDX', 'estimate': 3, 'lower_bound': 3, 'upper_bound': 3}]}{'value': [{'value': 'MIA', 'estimate': 10, 'lower_bound': 10, 'upper_bound': 10}, {'value': 'IAH', 'estimate': 10, 'lower_bound': 10, 'upper_bound': 10}, {'value': 'MSP', 'estimate': 9, 'lower_bound': 9, 'upper_bound': 9}, {'value': 'SEA', 'estimate': 9, 'lower_bound': 9, 'upper_bound': 9}, {'value': 'ATL', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'DFW', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'MCO', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'DEN', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}, {'value': 'PHX', 'estimate': 4, 'lower_bound': 4, 'upper_bound': 4}, {'value': 'CLT', 'estimate': 4, 'lower_bound': 4, 'upper_bound': 4}]}
DuplicateCountNaNNaNNaNNaN{'count': 88, 'percentage': 88.0}NaN{'count': 56, 'percentage': 56.00000000000001}{'count': 71, 'percentage': 71.0}
ModeNaNNaNNaNNaN{'value': ['AA']}NaN{'value': ['ANC']}{'value': ['MIA', 'IAH']}
\n", + "
" + ], + "text/plain": [ + " YEAR \\\n", + "Skewness {'value': None} \n", + "StandardDeviation {'value': 0.0} \n", + "Min {'value': 2015.0} \n", + "IsConstantFeature {'value': True} \n", + "IQR {'value': 0.0} \n", + "Range {'value': 0.0} \n", + "ProbabilityDistribution {'bins': [2015.0], 'density': [1.0]} \n", + "Variance {'value': 0.0} \n", + "TypeMetric {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", + "FrequencyDistribution {'bins': [2015.0], 'frequency': [100]} \n", + "Count {'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", + "Max {'value': 2015.0} \n", + "DistinctCount {'value': 1} \n", + "Sum {'value': 201500.0} \n", + "IsQuasiConstantFeature {'value': True} \n", + "Quartiles {'q1': 2015.0, 'q2': 2015.0, 'q3': 2015.0} \n", + "Mean {'value': 2015.0} \n", + "Kurtosis {'value': None} \n", + "TopKFrequentElements NaN \n", + "DuplicateCount NaN \n", + "Mode NaN \n", + "\n", + " MONTH \\\n", + "Skewness {'value': None} \n", + "StandardDeviation {'value': 0.0} \n", + "Min {'value': 1.0} \n", + "IsConstantFeature {'value': True} \n", + "IQR {'value': 0.0} \n", + "Range {'value': 0.0} \n", + "ProbabilityDistribution {'bins': [1.0], 'density': [1.0]} \n", + "Variance {'value': 0.0} \n", + "TypeMetric {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", + "FrequencyDistribution {'bins': [1.0], 'frequency': [100]} \n", + "Count {'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", + "Max {'value': 1.0} \n", + "DistinctCount {'value': 1} \n", + "Sum {'value': 100.0} \n", + "IsQuasiConstantFeature {'value': True} \n", + "Quartiles {'q1': 1.0, 'q2': 1.0, 'q3': 1.0} \n", + "Mean {'value': 1.0} \n", + "Kurtosis {'value': None} \n", + "TopKFrequentElements NaN \n", + "DuplicateCount NaN \n", + "Mode NaN \n", + "\n", + " DAY \\\n", + "Skewness {'value': None} \n", + "StandardDeviation {'value': 0.0} \n", + "Min {'value': 1.0} \n", + "IsConstantFeature {'value': True} \n", + "IQR {'value': 0.0} \n", + "Range {'value': 0.0} \n", + "ProbabilityDistribution {'bins': [1.0], 'density': [1.0]} \n", + "Variance {'value': 0.0} \n", + "TypeMetric {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", + "FrequencyDistribution {'bins': [1.0], 'frequency': [100]} \n", + "Count {'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", + "Max {'value': 1.0} \n", + "DistinctCount {'value': 1} \n", + "Sum {'value': 100.0} \n", + "IsQuasiConstantFeature {'value': True} \n", + "Quartiles {'q1': 1.0, 'q2': 1.0, 'q3': 1.0} \n", + "Mean {'value': 1.0} \n", + "Kurtosis {'value': None} \n", + "TopKFrequentElements NaN \n", + "DuplicateCount NaN \n", + "Mode NaN \n", + "\n", + " DAY_OF_WEEK \\\n", + "Skewness {'value': None} \n", + "StandardDeviation {'value': 0.0} \n", + "Min {'value': 4.0} \n", + "IsConstantFeature {'value': True} \n", + "IQR {'value': 0.0} \n", + "Range {'value': 0.0} \n", + "ProbabilityDistribution {'bins': [4.0], 'density': [1.0]} \n", + "Variance {'value': 0.0} \n", + "TypeMetric {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", + "FrequencyDistribution {'bins': [4.0], 'frequency': [100]} \n", + "Count {'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", + "Max {'value': 4.0} \n", + "DistinctCount {'value': 1} \n", + "Sum {'value': 400.0} \n", + "IsQuasiConstantFeature {'value': True} \n", + "Quartiles {'q1': 4.0, 'q2': 4.0, 'q3': 4.0} \n", + "Mean {'value': 4.0} \n", + "Kurtosis {'value': None} \n", + "TopKFrequentElements NaN \n", + "DuplicateCount NaN \n", + "Mode NaN \n", + "\n", + " AIRLINE \\\n", + "Skewness NaN \n", + "StandardDeviation NaN \n", + "Min NaN \n", + "IsConstantFeature NaN \n", + "IQR NaN \n", + "Range NaN \n", + "ProbabilityDistribution NaN \n", + "Variance NaN \n", + "TypeMetric {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", + "FrequencyDistribution NaN \n", + "Count {'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", + "Max NaN \n", + "DistinctCount {'value': 12} \n", + "Sum NaN \n", + "IsQuasiConstantFeature NaN \n", + "Quartiles NaN \n", + "Mean NaN \n", + "Kurtosis NaN \n", + "TopKFrequentElements {'value': [{'value': 'AA', 'estimate': 14, 'lower_bound': 14, 'upper_bound': 14}, {'value': 'B6', 'estimate': 12, 'lower_bound': 12, 'upper_bound': 12}, {'value': 'NK', 'estimate': 11, 'lower_bound': 11, 'upper_bound': 11}, {'value': 'UA', 'estimate': 11, 'lower_bound': 11, 'upper_bound': 11}, {'value': 'AS', 'estimate': 11, 'lower_bound': 11, 'upper_bound': 11}, {'value': 'DL', 'estimate': 11, 'lower_bound': 11, 'upper_bound': 11}, {'value': 'US', 'estimate': 8, 'lower_bound': 8, 'upper_bound': 8}, {'value': 'OO', 'estimate': 8, 'lower_bound': 8, 'upper_bound': 8}, {'value': 'EV', 'estimate': 7, 'lower_bound': 7, 'upper_bound': 7}, {'value': 'HA', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}]} \n", + "DuplicateCount {'count': 88, 'percentage': 88.0} \n", + "Mode {'value': ['AA']} \n", + "\n", + " FLIGHT_NUMBER \\\n", + "Skewness {'value': 1.545298800400988} \n", + "StandardDeviation {'value': 1873.257011170651} \n", + "Min {'value': 17.0} \n", + "IsConstantFeature {'value': False} \n", + "IQR {'value': 1905.0} \n", + "Range {'value': 7402.0} \n", + "ProbabilityDistribution {'bins': [17.0, 272.2413793103448, 527.4827586206897, 782.7241379310344, 1037.9655172413793, 1293.2068965517242, 1548.4482758620688, 1803.6896551724137, 2058.9310344827586, 2314.1724137931033, 2569.4137931034484, 2824.655172413793, 3079.8965517241377, 3335.137931034483, 3590.3793103448274, 3845.6206896551726, 4100.862068965517, 4356.103448275862, 4611.3448275862065, 4866.586206896552, 5121.827586206897, 5377.068965517241, 5632.310344827586, 5887.551724137931, 6142.793103448275, 6398.0344827586205, 6653.275862068966, 6908.517241379311, 7163.758620689655, 7419.0], 'density': [0.22, 0.1, 0.10999999999999902, 0.049999999999999004, 0.08999999999999901, 0.07, 0.04, 0.039999999999999, 0.04, 0.06999999999999901, 0.01, 0.01, 0.0, 0.0, 0.0, 0.0, 0.01, 0.01, 0.01, 0.0, 0.030000000000000002, 0.039999999999999, 0.01, 0.0, 0.01, 0.0, 0.0, 0.0, 0.02, 0.01]} \n", + "Variance {'value': 3509091.8299000002} \n", + "TypeMetric {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", + "FrequencyDistribution {'bins': [17.0, 272.2413793103448, 527.4827586206897, 782.7241379310344, 1037.9655172413793, 1293.2068965517242, 1548.4482758620688, 1803.6896551724137, 2058.9310344827586, 2314.1724137931033, 2569.4137931034484, 2824.655172413793, 3079.8965517241377, 3335.137931034483, 3590.3793103448274, 3845.6206896551726, 4100.862068965517, 4356.103448275862, 4611.3448275862065, 4866.586206896552, 5121.827586206897, 5377.068965517241, 5632.310344827586, 5887.551724137931, 6142.793103448275, 6398.0344827586205, 6653.275862068966, 6908.517241379311, 7163.758620689655, 7419.0], 'frequency': [22, 10, 11, 5, 9, 7, 4, 4, 4, 7, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 3, 4, 1, 0, 1, 0, 0, 0, 2, 1]} \n", + "Count {'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", + "Max {'value': 7419.0} \n", + "DistinctCount {'value': 96} \n", + "Sum {'value': 171151.0} \n", + "IsQuasiConstantFeature {'value': False} \n", + "Quartiles {'q1': 371.0, 'q2': 1162.0, 'q3': 2276.0} \n", + "Mean {'value': 1711.5100000000002} \n", + "Kurtosis {'value': 1.505850931533642} \n", + "TopKFrequentElements NaN \n", + "DuplicateCount NaN \n", + "Mode NaN \n", + "\n", + " ORIGIN_AIRPORT \\\n", + "Skewness NaN \n", + "StandardDeviation NaN \n", + "Min NaN \n", + "IsConstantFeature NaN \n", + "IQR NaN \n", + "Range NaN \n", + "ProbabilityDistribution NaN \n", + "Variance NaN \n", + "TypeMetric {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", + "FrequencyDistribution NaN \n", + "Count {'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", + "Max NaN \n", + "DistinctCount {'value': 44} \n", + "Sum NaN \n", + "IsQuasiConstantFeature NaN \n", + "Quartiles NaN \n", + "Mean NaN \n", + "Kurtosis NaN \n", + "TopKFrequentElements {'value': [{'value': 'ANC', 'estimate': 10, 'lower_bound': 10, 'upper_bound': 10}, {'value': 'LAS', 'estimate': 9, 'lower_bound': 9, 'upper_bound': 9}, {'value': 'SJU', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'LAX', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'SFO', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}, {'value': 'PHX', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}, {'value': 'SEA', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}, {'value': 'HNL', 'estimate': 4, 'lower_bound': 4, 'upper_bound': 4}, {'value': 'ORD', 'estimate': 4, 'lower_bound': 4, 'upper_bound': 4}, {'value': 'PDX', 'estimate': 3, 'lower_bound': 3, 'upper_bound': 3}]} \n", + "DuplicateCount {'count': 56, 'percentage': 56.00000000000001} \n", + "Mode {'value': ['ANC']} \n", + "\n", + " DESTINATION_AIRPORT \n", + "Skewness NaN \n", + "StandardDeviation NaN \n", + "Min NaN \n", + "IsConstantFeature NaN \n", + "IQR NaN \n", + "Range NaN \n", + "ProbabilityDistribution NaN \n", + "Variance NaN \n", + "TypeMetric {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", + "FrequencyDistribution NaN \n", + "Count {'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", + "Max NaN \n", + "DistinctCount {'value': 29} \n", + "Sum NaN \n", + "IsQuasiConstantFeature NaN \n", + "Quartiles NaN \n", + "Mean NaN \n", + "Kurtosis NaN \n", + "TopKFrequentElements {'value': [{'value': 'MIA', 'estimate': 10, 'lower_bound': 10, 'upper_bound': 10}, {'value': 'IAH', 'estimate': 10, 'lower_bound': 10, 'upper_bound': 10}, {'value': 'MSP', 'estimate': 9, 'lower_bound': 9, 'upper_bound': 9}, {'value': 'SEA', 'estimate': 9, 'lower_bound': 9, 'upper_bound': 9}, {'value': 'ATL', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'DFW', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'MCO', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'DEN', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}, {'value': 'PHX', 'estimate': 4, 'lower_bound': 4, 'upper_bound': 4}, {'value': 'CLT', 'estimate': 4, 'lower_bound': 4, 'upper_bound': 4}]} \n", + "DuplicateCount {'count': 71, 'percentage': 71.0} \n", + "Mode {'value': ['MIA', 'IAH']} " + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_group_flights.get_statistics().to_pandas()" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "7cfc56fe", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
0
results[{'exception_info': {'raised_exception': False, 'exception_traceback': None, 'exception_message': None}, 'meta': {}, 'result': {'element_count': 14, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'partial_unexpected_list': []}, 'expectation_config': {'meta': {}, 'expectation_type': 'expect_column_values_to_not_be_null', 'kwargs': {'column': 'IATA_CODE', 'batch_id': '90bbaf1a6a4ae45a238e05e0d240a033'}}, 'success': True}]
successTrue
meta.great_expectations_version0.15.39
meta.expectation_suite_nameairlines_feature_group
meta.run_id.run_time2023-07-14T04:30:58.945832+00:00
meta.run_id.run_nameNone
meta.batch_markers.ge_load_time20230714T043058.944828Z
meta.active_batch_definition.datasource_namefeature-ingestion-pipeline
meta.active_batch_definition.data_connector_namefeature-ingestion-pipeline
meta.active_batch_definition.data_asset_namefeature-ingestion-pipeline
meta.active_batch_definition.batch_identifiers.ge_batch_id3b3f551a-21ff-11ee-9023-0242ac130002
meta.validation_time20230714T043058.945751Z
meta.checkpoint_nameNone
statistics.evaluated_expectations1
statistics.successful_expectations1
statistics.unsuccessful_expectations0
statistics.success_percent100.0
\n", + "
" + ], + "text/plain": [ + " 0\n", + "results [{'exception_info': {'raised_exception': False, 'exception_traceback': None, 'exception_message': None}, 'meta': {}, 'result': {'element_count': 14, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'partial_unexpected_list': []}, 'expectation_config': {'meta': {}, 'expectation_type': 'expect_column_values_to_not_be_null', 'kwargs': {'column': 'IATA_CODE', 'batch_id': '90bbaf1a6a4ae45a238e05e0d240a033'}}, 'success': True}]\n", + "success True\n", + "meta.great_expectations_version 0.15.39\n", + "meta.expectation_suite_name airlines_feature_group\n", + "meta.run_id.run_time 2023-07-14T04:30:58.945832+00:00\n", + "meta.run_id.run_name None\n", + "meta.batch_markers.ge_load_time 20230714T043058.944828Z\n", + "meta.active_batch_definition.datasource_name feature-ingestion-pipeline\n", + "meta.active_batch_definition.data_connector_name feature-ingestion-pipeline\n", + "meta.active_batch_definition.data_asset_name feature-ingestion-pipeline\n", + "meta.active_batch_definition.batch_identifiers.ge_batch_id 3b3f551a-21ff-11ee-9023-0242ac130002\n", + "meta.validation_time 20230714T043058.945751Z\n", + "meta.checkpoint_name None\n", + "statistics.evaluated_expectations 1\n", + "statistics.successful_expectations 1\n", + "statistics.unsuccessful_expectations 0\n", + "statistics.success_percent 100.0" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_group_airlines.get_validation_output().to_pandas()" + ] + }, + { + "cell_type": "markdown", + "id": "a84c1e68", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.4. Select subset of features" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "d81ddcbb", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+---------+\n", + "|IATA_CODE|\n", + "+---------+\n", + "| NK|\n", + "| WN|\n", + "| DL|\n", + "| EV|\n", + "| HA|\n", + "| MQ|\n", + "| VX|\n", + "| UA|\n", + "| AA|\n", + "| US|\n", + "+---------+\n", + "only showing top 10 rows\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + } + ], + "source": [ + "feature_group_airlines.select(['IATA_CODE']).show()" + ] + }, + { + "cell_type": "markdown", + "id": "2416e2a2", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.5. Filter feature groups" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "19267e79", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[Stage 50:> (0 + 1) / 1]\r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+---------+--------------------+\n", + "|IATA_CODE| AIRLINE|\n", + "+---------+--------------------+\n", + "| EV|Atlantic Southeas...|\n", + "+---------+--------------------+\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + } + ], + "source": [ + "feature_group_airlines.filter(feature_group_airlines.IATA_CODE == \"EV\").show()" + ] + }, + { + "cell_type": "markdown", + "id": "22da4132", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.6. Apply joins on feature group\n", + "As in Pandas, if the feature has the same name on both feature groups, then you can use the `on=[]` paramter. If they have different names, then you can use the `left_on=[]` and `right_on=[]` paramters:" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "212c1750", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+-------+----+-----+-------+--------+---------+\n", + "|YEAR|MONTH|DAY|DAY_OF_WEEK|AIRLINE|FLIGHT_NUMBER|ORIGIN_AIRPORT|DESTINATION_AIRPORT|IATA_CODE|AIRPORT|CITY|STATE|COUNTRY|LATITUDE|LONGITUDE|\n", + "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+-------+----+-----+-------+--------+---------+\n", + "|2015| 1| 1| 4| B6| 1030| BQN| MCO| null| null|null| null| null| null| null|\n", + "|2015| 1| 1| 4| B6| 262| SJU| BOS| null| null|null| null| null| null| null|\n", + "|2015| 1| 1| 4| B6| 2134| SJU| MCO| null| null|null| null| null| null| null|\n", + "|2015| 1| 1| 4| B6| 730| BQN| MCO| null| null|null| null| null| null| null|\n", + "|2015| 1| 1| 4| B6| 768| PSE| MCO| null| null|null| null| null| null| null|\n", + "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+-------+----+-----+-------+--------+---------+\n", + "only showing top 5 rows\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + } + ], + "source": [ + "from ads.feature_store.common.enums import JoinType\n", + "\n", + "query = (\n", + " feature_group_flights.select()\n", + " .join(feature_group_airlines.select(), left_on=['ORIGIN_AIRPORT'], right_on=['IATA_CODE'], join_type=JoinType.LEFT)\n", + " .join(feature_group_airports.select(), left_on=['AIRLINE'], right_on=['IATA_CODE'], join_type=JoinType.LEFT)\n", + ")\n", + "query.show(5)" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "7690e2a4", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'SELECT fg_2.YEAR YEAR, fg_2.MONTH MONTH, fg_2.DAY DAY, fg_2.DAY_OF_WEEK DAY_OF_WEEK, fg_2.AIRLINE AIRLINE, fg_2.FLIGHT_NUMBER FLIGHT_NUMBER, fg_2.ORIGIN_AIRPORT ORIGIN_AIRPORT, fg_2.DESTINATION_AIRPORT DESTINATION_AIRPORT, fg_0.IATA_CODE IATA_CODE, fg_1.AIRPORT AIRPORT, fg_1.CITY CITY, fg_1.STATE STATE, fg_1.COUNTRY COUNTRY, fg_1.LATITUDE LATITUDE, fg_1.LONGITUDE LONGITUDE FROM `843E320A28F319748425787F04BCD3B8`.flights_feature_group fg_2 LEFT JOIN `843E320A28F319748425787F04BCD3B8`.airlines_feature_group fg_0 ON fg_2.ORIGIN_AIRPORT = fg_0.IATA_CODE LEFT JOIN `843E320A28F319748425787F04BCD3B8`.airport_feature_group fg_1 ON fg_0.AIRLINE = fg_1.IATA_CODE'" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "query.to_string()" + ] + }, + { + "cell_type": "markdown", + "id": "a249cbe0", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.7 Create dataset\n", + "A dataset is a collection of feature snapshots that are joined together to either train a model or perform model inference.\n", + "\n", + "
\n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "ad857582", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "dataset = (\n", + " Dataset()\n", + " .with_description(\"Combined dataset for flights\")\n", + " .with_compartment_id(compartment_id)\n", + " .with_name(\"flights_dataset\")\n", + " .with_entity_id(entity.id)\n", + " .with_feature_store_id(feature_store.id)\n", + " .with_query(query.to_string())\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "c61e568b", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "##### Create Dataset\n", + "\n", + "Call the ```.create()``` method of the Dataset instance to create a dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "ca7becdf", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "kind: Dataset\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", + " description: Combined dataset for flights\n", + " entityId: 843E320A28F319748425787F04BCD3B8\n", + " featureStoreId: 751D665EB6AE7360928F15705F9F0F48\n", + " id: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " name: flights_dataset\n", + " query: SELECT fg_2.YEAR YEAR, fg_2.MONTH MONTH, fg_2.DAY DAY, fg_2.DAY_OF_WEEK DAY_OF_WEEK,\n", + " fg_2.AIRLINE AIRLINE, fg_2.FLIGHT_NUMBER FLIGHT_NUMBER, fg_2.ORIGIN_AIRPORT ORIGIN_AIRPORT,\n", + " fg_2.DESTINATION_AIRPORT DESTINATION_AIRPORT, fg_0.IATA_CODE IATA_CODE, fg_1.AIRPORT\n", + " AIRPORT, fg_1.CITY CITY, fg_1.STATE STATE, fg_1.COUNTRY COUNTRY, fg_1.LATITUDE\n", + " LATITUDE, fg_1.LONGITUDE LONGITUDE FROM `843E320A28F319748425787F04BCD3B8`.flights_feature_group\n", + " fg_2 LEFT JOIN `843E320A28F319748425787F04BCD3B8`.airlines_feature_group fg_0\n", + " ON fg_2.ORIGIN_AIRPORT = fg_0.IATA_CODE LEFT JOIN `843E320A28F319748425787F04BCD3B8`.airport_feature_group\n", + " fg_1 ON fg_0.AIRLINE = fg_1.IATA_CODE\n", + " statisticsConfig:\n", + " isEnabled: true\n", + "type: dataset" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dataset.create()" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "597e3dd1", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:mlm_insights.builder:validating required components\n", + "INFO:mlm_insights.builder:required components validated\n", + "INFO:mlm_insights.builder.usage:Activating Minimal Insights Usage\n", + "INFO:mlm_insights.builder:Generating Runner object\n", + "INFO:mlm_insights.builder:Generating workflow request\n", + "INFO:mlm_insights.workflow:Fetching engine object\n", + "INFO:mlm_insights.workflow:Returning native engine object\n", + "INFO:mlm_insights.builder:Running Fugue Workflow\n", + "INFO:mlm_insights.workflow:Executing Fugue Workflow\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", + " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", + " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", + " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", + " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.\n", + " return _methods._mean(a, axis=axis, dtype=dtype,\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide\n", + " ret = ret.dtype.type(ret / rcount)\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.\n", + " return _methods._mean(a, axis=axis, dtype=dtype,\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide\n", + " ret = ret.dtype.type(ret / rcount)\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", + " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", + " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", + " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", + " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.\n", + " return _methods._mean(a, axis=axis, dtype=dtype,\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide\n", + " ret = ret.dtype.type(ret / rcount)\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.\n", + " return _methods._mean(a, axis=axis, dtype=dtype,\n", + "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide\n", + " ret = ret.dtype.type(ret / rcount)\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c31570>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=2015.0, minimum=2015.0, maximum=2015.0, central_moments=[1.0, 0.0, 0.0, 0.0, 0.0]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c31270>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9589030>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9590f30>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=1.0, minimum=1.0, maximum=1.0, central_moments=[1.0, 0.0, 0.0, 0.0, 0.0]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef95613f0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef95842f0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef95909b0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=1.0, minimum=1.0, maximum=1.0, central_moments=[1.0, 0.0, 0.0, 0.0, 0.0]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9590e30>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9c1f1b0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c1f330>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=4.0, minimum=4.0, maximum=4.0, central_moments=[1.0, 0.0, 0.0, 0.0, 0.0]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c1fc30>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9c1f8f0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c1f2b0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c1fdb0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c1f3b0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=1711.51, minimum=17.0, maximum=7419.0, central_moments=[1.0, 0.0, 3509091.8299000002, 10157914842.877602, 55483811382672.16]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c1f470>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9c1fd70>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c1f630>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c1fb30>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c1f9b0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c1fbb0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c23ab0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c23470>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c235f0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c43670>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c435f0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c43c70>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c43bf0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c43f70>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c3f4b0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c3ff70>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c3f630>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=0.0, mean=nan, minimum=nan, maximum=nan, central_moments=[nan, nan, nan, nan, nan]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9502870>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9526770>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9526d70>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=0.0, mean=nan, minimum=nan, maximum=nan, central_moments=[nan, nan, nan, nan, nan]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9526df0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9526070>)} sfc map\n", + "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", + "INFO:mlm_insights.builder:Profile Generated Successfully\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 2015.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [2015.0], 'density': [1.0]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [2015.0], 'frequency': [100]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 2015.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 201500.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 2015.0, 'q2': 2015.0, 'q3': 2015.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 2015.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 1.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [1.0], 'density': [1.0]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [1.0], 'frequency': [100]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 1.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 100.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 1.0, 'q2': 1.0, 'q3': 1.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 1.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 1.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [1.0], 'density': [1.0]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [1.0], 'frequency': [100]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 1.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 100.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 1.0, 'q2': 1.0, 'q3': 1.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 1.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 4.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [4.0], 'density': [1.0]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [4.0], 'frequency': [100]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 4.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 400.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 4.0, 'q2': 4.0, 'q3': 4.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 4.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='AA', estimate=14, lower_bound=14, upper_bound=14), FrequentItemEstimate(value='B6', estimate=12, lower_bound=12, upper_bound=12), FrequentItemEstimate(value='UA', estimate=11, lower_bound=11, upper_bound=11), FrequentItemEstimate(value='AS', estimate=11, lower_bound=11, upper_bound=11), FrequentItemEstimate(value='NK', estimate=11, lower_bound=11, upper_bound=11), FrequentItemEstimate(value='DL', estimate=11, lower_bound=11, upper_bound=11), FrequentItemEstimate(value='US', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='OO', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='EV', estimate=7, lower_bound=7, upper_bound=7), FrequentItemEstimate(value='HA', estimate=5, lower_bound=5, upper_bound=5)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 12.000000327825557 in Distinct count SFC, upper bound = 12.000599478849342, lower bound = 12.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 88, 'percentage': 88.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['AA']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 12.000000327825557 in Distinct count SFC, upper bound = 12.000599478849342, lower bound = 12.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 12.000000327825557\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: 1.5452988004009884\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 1873.257011170651\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 17.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 1905.0\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 7402.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [17.0, 272.2413793103448, 527.4827586206897, 782.7241379310344, 1037.9655172413793, 1293.2068965517242, 1548.4482758620688, 1803.6896551724137, 2058.9310344827586, 2314.1724137931033, 2569.4137931034484, 2824.655172413793, 3079.8965517241377, 3335.137931034483, 3590.3793103448274, 3845.6206896551726, 4100.862068965517, 4356.103448275862, 4611.3448275862065, 4866.586206896552, 5121.827586206897, 5377.068965517241, 5632.310344827586, 5887.551724137931, 6142.793103448275, 6398.0344827586205, 6653.275862068966, 6908.517241379311, 7163.758620689655, 7419.0], 'density': [0.22, 0.1, 0.10999999999999999, 0.04999999999999999, 0.08999999999999997, 0.07000000000000006, 0.040000000000000036, 0.039999999999999925, 0.040000000000000036, 0.06999999999999995, 0.010000000000000009, 0.010000000000000009, 0.0, 0.0, 0.0, 0.0, 0.010000000000000009, 0.010000000000000009, 0.010000000000000009, 0.0, 0.030000000000000027, 0.039999999999999925, 0.010000000000000009, 0.0, 0.010000000000000009, 0.0, 0.0, 0.0, 0.020000000000000018, 0.010000000000000009]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 3509091.8299000002\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [17.0, 272.2413793103448, 527.4827586206897, 782.7241379310344, 1037.9655172413793, 1293.2068965517242, 1548.4482758620688, 1803.6896551724137, 2058.9310344827586, 2314.1724137931033, 2569.4137931034484, 2824.655172413793, 3079.8965517241377, 3335.137931034483, 3590.3793103448274, 3845.6206896551726, 4100.862068965517, 4356.103448275862, 4611.3448275862065, 4866.586206896552, 5121.827586206897, 5377.068965517241, 5632.310344827586, 5887.551724137931, 6142.793103448275, 6398.0344827586205, 6653.275862068966, 6908.517241379311, 7163.758620689655, 7419.0], 'frequency': [22, 10, 11, 5, 9, 7, 4, 4, 4, 7, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 3, 4, 1, 0, 1, 0, 0, 0, 2, 1]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 7419.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 96.00002264977122 in Distinct count SFC, upper bound = 96.00481585896145, lower bound = 96.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 96.00002264977122\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 171151.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 371.0, 'q2': 1162.0, 'q3': 2276.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 1711.51\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.5058509315336428\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='ANC', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='LAS', estimate=9, lower_bound=9, upper_bound=9), FrequentItemEstimate(value='SJU', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='LAX', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='SFO', estimate=5, lower_bound=5, upper_bound=5), FrequentItemEstimate(value='PHX', estimate=5, lower_bound=5, upper_bound=5), FrequentItemEstimate(value='SEA', estimate=5, lower_bound=5, upper_bound=5), FrequentItemEstimate(value='HNL', estimate=4, lower_bound=4, upper_bound=4), FrequentItemEstimate(value='ORD', estimate=4, lower_bound=4, upper_bound=4), FrequentItemEstimate(value='PDX', estimate=3, lower_bound=3, upper_bound=3)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 44.000004698833415 in Distinct count SFC, upper bound = 44.00220158609522, lower bound = 44.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 56, 'percentage': 56.00000000000001}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['ANC']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 44.000004698833415 in Distinct count SFC, upper bound = 44.00220158609522, lower bound = 44.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 44.000004698833415\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='IAH', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='MIA', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='SEA', estimate=9, lower_bound=9, upper_bound=9), FrequentItemEstimate(value='MSP', estimate=9, lower_bound=9, upper_bound=9), FrequentItemEstimate(value='ATL', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='DFW', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='MCO', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='DEN', estimate=5, lower_bound=5, upper_bound=5), FrequentItemEstimate(value='CLT', estimate=4, lower_bound=4, upper_bound=4), FrequentItemEstimate(value='PHX', estimate=4, lower_bound=4, upper_bound=4)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 29.00000201662398 in Distinct count SFC, upper bound = 29.00144996499259, lower bound = 29.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 71, 'percentage': 71.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['IAH', 'MIA']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 29.00000201662398 in Distinct count SFC, upper bound = 29.00144996499259, lower bound = 29.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 29.00000201662398\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 100.0, 'missing_count_percentage': 100.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 0.0\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 100.0, 'missing_count_percentage': 100.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 0.0\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 100.0, 'missing_count_percentage': 100.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 0.0\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 100.0, 'missing_count_percentage': 100.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 0.0\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 100.0, 'missing_count_percentage': 100.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 100.0, 'missing_count_percentage': 100.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 0.0\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 100.0, 'missing_count_percentage': 100.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 0.0\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 0.0\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 100.0\n", + "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", + "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", + "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", + "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", + "│ 6881C3E17FC9BBB02934BB7B6B9068D1 │ DATASET │ Succeeded │ None │\n", + "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" + ] + } + ], + "source": [ + "dataset.materialise()" + ] + }, + { + "cell_type": "markdown", + "id": "2b775d67", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "### Interoperability with model" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "f14c80c4", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "kind: Dataset\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", + " description: Combined dataset for flights\n", + " entityId: 843E320A28F319748425787F04BCD3B8\n", + " featureStoreId: 751D665EB6AE7360928F15705F9F0F48\n", + " id: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " jobId: f8a347ca-db9a-4ba6-adbf-c3a5f0c61441\n", + " modelDetails:\n", + " items:\n", + " - ocid1.modelcatalog.oc1.unique_ocid\n", + " name: flights_dataset\n", + " outputFeatureDetails:\n", + " items:\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: LONG\n", + " name: YEAR\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: LONG\n", + " name: MONTH\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: LONG\n", + " name: DAY\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: LONG\n", + " name: DAY_OF_WEEK\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: STRING\n", + " name: AIRLINE\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: LONG\n", + " name: FLIGHT_NUMBER\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: STRING\n", + " name: ORIGIN_AIRPORT\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: STRING\n", + " name: DESTINATION_AIRPORT\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: STRING\n", + " name: IATA_CODE\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: STRING\n", + " name: AIRPORT\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: STRING\n", + " name: CITY\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: STRING\n", + " name: STATE\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: STRING\n", + " name: COUNTRY\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: DOUBLE\n", + " name: LATITUDE\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: DOUBLE\n", + " name: LONGITUDE\n", + " query: SELECT fg_2.YEAR YEAR, fg_2.MONTH MONTH, fg_2.DAY DAY, fg_2.DAY_OF_WEEK DAY_OF_WEEK,\n", + " fg_2.AIRLINE AIRLINE, fg_2.FLIGHT_NUMBER FLIGHT_NUMBER, fg_2.ORIGIN_AIRPORT ORIGIN_AIRPORT,\n", + " fg_2.DESTINATION_AIRPORT DESTINATION_AIRPORT, fg_0.IATA_CODE IATA_CODE, fg_1.AIRPORT\n", + " AIRPORT, fg_1.CITY CITY, fg_1.STATE STATE, fg_1.COUNTRY COUNTRY, fg_1.LATITUDE\n", + " LATITUDE, fg_1.LONGITUDE LONGITUDE FROM `843E320A28F319748425787F04BCD3B8`.flights_feature_group\n", + " fg_2 LEFT JOIN `843E320A28F319748425787F04BCD3B8`.airlines_feature_group fg_0\n", + " ON fg_2.ORIGIN_AIRPORT = fg_0.IATA_CODE LEFT JOIN `843E320A28F319748425787F04BCD3B8`.airport_feature_group\n", + " fg_1 ON fg_0.AIRLINE = fg_1.IATA_CODE\n", + " statisticsConfig:\n", + " isEnabled: true\n", + "type: dataset" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_details = ModelDetails().with_items([\"ocid1.modelcatalog.oc1.unique_ocid\"])\n", + "dataset.with_model_details(model_details)" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "8b5d9b08", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "kind: Dataset\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", + " description: Combined dataset for flights\n", + " entityId: 843E320A28F319748425787F04BCD3B8\n", + " featureStoreId: 751D665EB6AE7360928F15705F9F0F48\n", + " id: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " jobId: f8a347ca-db9a-4ba6-adbf-c3a5f0c61441\n", + " modelDetails:\n", + " items:\n", + " - ocid1.modelcatalog.oc1.unique_ocid\n", + " name: flights_dataset\n", + " outputFeatureDetails:\n", + " items:\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: LONG\n", + " name: YEAR\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: LONG\n", + " name: MONTH\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: LONG\n", + " name: DAY\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: LONG\n", + " name: DAY_OF_WEEK\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: STRING\n", + " name: AIRLINE\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: LONG\n", + " name: FLIGHT_NUMBER\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: STRING\n", + " name: ORIGIN_AIRPORT\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: STRING\n", + " name: DESTINATION_AIRPORT\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: STRING\n", + " name: IATA_CODE\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: STRING\n", + " name: AIRPORT\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: STRING\n", + " name: CITY\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: STRING\n", + " name: STATE\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: STRING\n", + " name: COUNTRY\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: DOUBLE\n", + " name: LATITUDE\n", + " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", + " featureType: DOUBLE\n", + " name: LONGITUDE\n", + " query: SELECT fg_2.YEAR YEAR, fg_2.MONTH MONTH, fg_2.DAY DAY, fg_2.DAY_OF_WEEK DAY_OF_WEEK,\n", + " fg_2.AIRLINE AIRLINE, fg_2.FLIGHT_NUMBER FLIGHT_NUMBER, fg_2.ORIGIN_AIRPORT ORIGIN_AIRPORT,\n", + " fg_2.DESTINATION_AIRPORT DESTINATION_AIRPORT, fg_0.IATA_CODE IATA_CODE, fg_1.AIRPORT\n", + " AIRPORT, fg_1.CITY CITY, fg_1.STATE STATE, fg_1.COUNTRY COUNTRY, fg_1.LATITUDE\n", + " LATITUDE, fg_1.LONGITUDE LONGITUDE FROM `843E320A28F319748425787F04BCD3B8`.flights_feature_group\n", + " fg_2 LEFT JOIN `843E320A28F319748425787F04BCD3B8`.airlines_feature_group fg_0\n", + " ON fg_2.ORIGIN_AIRPORT = fg_0.IATA_CODE LEFT JOIN `843E320A28F319748425787F04BCD3B8`.airport_feature_group\n", + " fg_1 ON fg_0.AIRLINE = fg_1.IATA_CODE\n", + " statisticsConfig:\n", + " isEnabled: true\n", + "type: dataset" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dataset.update()" + ] + }, + { + "cell_type": "markdown", + "id": "ba077d02", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "##### Visualise lineage\n", + "\n", + "Use the ```.show()``` method on the Dataset instance to visualize the lineage of the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "ad764d69", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "%3\n", + "\n", + "\n", + "751D665EB6AE7360928F15705F9F0F48\n", + "\n", + "flights details\n", + "Feature Store\n", + "751D665EB6AE7360928F15705F9F0F48\n", + "\n", + "\n", + "843E320A28F319748425787F04BCD3B8\n", + "\n", + "Flight details2\n", + "Entity\n", + "843E320A28F319748425787F04BCD3B8\n", + "\n", + "\n", + "751D665EB6AE7360928F15705F9F0F48->843E320A28F319748425787F04BCD3B8\n", + "\n", + "\n", + "\n", + "\n", + "4E21D2D878A101E8804837CAD6499FD9\n", + "\n", + "airlines_feature_group\n", + "Feature Group\n", + "4E21D2D878A101E8804837CAD6499FD9\n", + "\n", + "\n", + "843E320A28F319748425787F04BCD3B8->4E21D2D878A101E8804837CAD6499FD9\n", + "\n", + "\n", + "\n", + "\n", + "6881C3E17FC9BBB02934BB7B6B9068D1\n", + "\n", + "flights_dataset\n", + "Dataset\n", + "6881C3E17FC9BBB02934BB7B6B9068D1\n", + "\n", + "\n", + "843E320A28F319748425787F04BCD3B8->6881C3E17FC9BBB02934BB7B6B9068D1\n", + "\n", + "\n", + "\n", + "\n", + "C1771CFDA79A082BB9FB85D9E5FCB192\n", + "\n", + "airport_feature_group\n", + "Feature Group\n", + "C1771CFDA79A082BB9FB85D9E5FCB192\n", + "\n", + "\n", + "843E320A28F319748425787F04BCD3B8->C1771CFDA79A082BB9FB85D9E5FCB192\n", + "\n", + "\n", + "\n", + "\n", + "C24E858807F4EBA22BF14C08B9A6E2DD\n", + "\n", + "flights_feature_group\n", + "Feature Group\n", + "C24E858807F4EBA22BF14C08B9A6E2DD\n", + "\n", + "\n", + "843E320A28F319748425787F04BCD3B8->C24E858807F4EBA22BF14C08B9A6E2DD\n", + "\n", + "\n", + "\n", + "\n", + "4E21D2D878A101E8804837CAD6499FD9->6881C3E17FC9BBB02934BB7B6B9068D1\n", + "\n", + "\n", + "\n", + "\n", + "ocid1.modelcatalog.oc1.unique_ocid\n", + "\n", + " \n", + "Model\n", + "ocid1.modelcatalog.oc1.unique_ocid\n", + "\n", + "\n", + "6881C3E17FC9BBB02934BB7B6B9068D1->ocid1.modelcatalog.oc1.unique_ocid\n", + "\n", + "\n", + "\n", + "\n", + "C1771CFDA79A082BB9FB85D9E5FCB192->6881C3E17FC9BBB02934BB7B6B9068D1\n", + "\n", + "\n", + "\n", + "\n", + "C24E858807F4EBA22BF14C08B9A6E2DD->6881C3E17FC9BBB02934BB7B6B9068D1\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "dataset.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "5b46e716", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+------+--------------------+--------------------+-----------+--------------------+--------------------+-------------------+----------------+--------+-----------+----------+----------------+----------------+\n", + "|format| id| name|description| location| createdAt| lastModified|partitionColumns|numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|\n", + "+------+--------------------+--------------------+-----------+--------------------+--------------------+-------------------+----------------+--------+-----------+----------+----------------+----------------+\n", + "| delta|7b4825ef-5a04-4fb...|843e320a28f319748...| null|oci://default-sto...|2023-07-14 04:31:...|2023-07-14 04:32:11| []| 2| 9038| {}| 1| 2|\n", + "+------+--------------------+--------------------+-----------+--------------------+--------------------+-------------------+----------------+--------+-----------+----------+----------------+----------------+\n", + "\n" + ] + } + ], + "source": [ + "dataset.profile().show()" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "13e18a51", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+-------+----+-----+-------+--------+---------+\n", + "|YEAR|MONTH|DAY|DAY_OF_WEEK|AIRLINE|FLIGHT_NUMBER|ORIGIN_AIRPORT|DESTINATION_AIRPORT|IATA_CODE|AIRPORT|CITY|STATE|COUNTRY|LATITUDE|LONGITUDE|\n", + "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+-------+----+-----+-------+--------+---------+\n", + "|2015| 1| 1| 4| B6| 1030| BQN| MCO| null| null|null| null| null| null| null|\n", + "|2015| 1| 1| 4| B6| 262| SJU| BOS| null| null|null| null| null| null| null|\n", + "|2015| 1| 1| 4| B6| 2134| SJU| MCO| null| null|null| null| null| null| null|\n", + "|2015| 1| 1| 4| B6| 730| BQN| MCO| null| null|null| null| null| null| null|\n", + "|2015| 1| 1| 4| B6| 768| PSE| MCO| null| null|null| null| null| null| null|\n", + "|2015| 1| 1| 4| B6| 2276| SJU| BDL| null| null|null| null| null| null| null|\n", + "|2015| 1| 1| 4| US| 602| ORD| PHX| null| null|null| null| null| null| null|\n", + "|2015| 1| 1| 4| AS| 695| GEG| SEA| null| null|null| null| null| null| null|\n", + "|2015| 1| 1| 4| HA| 102| HNL| ITO| null| null|null| null| null| null| null|\n", + "|2015| 1| 1| 4| OO| 5467| ONT| SFO| null| null|null| null| null| null| null|\n", + "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+-------+----+-----+-------+--------+---------+\n", + "\n" + ] + } + ], + "source": [ + "dataset.preview().show()" + ] + }, + { + "cell_type": "markdown", + "id": "2f784a25", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.8 Freeform SQL query\n", + "Feature store provides a way to query feature store using free flow query. User need to mention `entity id` as the database name and `feature group name` as the table name to query feature store. This functionality can be useful if you need to express more complex queries for your use case" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "79bdaf43", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "entity_id = entity.id\n", + "\n", + "sql = (f\"SELECT flights_feature_group.*, airport_feature_group.IATA_CODE \"\n", + " f\"FROM `{entity_id}`.flights_feature_group flights_feature_group \"\n", + " f\"LEFT JOIN `{entity_id}`.airport_feature_group airport_feature_group \"\n", + " f\"ON flights_feature_group.ORIGIN_AIRPORT=airport_feature_group.IATA_CODE\")" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "8b02df32", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+\n", + "|YEAR|MONTH|DAY|DAY_OF_WEEK|AIRLINE|FLIGHT_NUMBER|ORIGIN_AIRPORT|DESTINATION_AIRPORT|IATA_CODE|\n", + "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+\n", + "|2015| 1| 1| 4| B6| 1030| BQN| MCO| BQN|\n", + "|2015| 1| 1| 4| B6| 262| SJU| BOS| SJU|\n", + "|2015| 1| 1| 4| B6| 2134| SJU| MCO| SJU|\n", + "|2015| 1| 1| 4| B6| 730| BQN| MCO| BQN|\n", + "|2015| 1| 1| 4| B6| 768| PSE| MCO| PSE|\n", + "|2015| 1| 1| 4| B6| 2276| SJU| BDL| SJU|\n", + "|2015| 1| 1| 4| US| 602| ORD| PHX| ORD|\n", + "|2015| 1| 1| 4| AS| 695| GEG| SEA| GEG|\n", + "|2015| 1| 1| 4| HA| 102| HNL| ITO| HNL|\n", + "|2015| 1| 1| 4| OO| 5467| ONT| SFO| ONT|\n", + "|2015| 1| 1| 4| HA| 108| HNL| KOA| HNL|\n", + "|2015| 1| 1| 4| AS| 730| ANC| SEA| ANC|\n", + "|2015| 1| 1| 4| HA| 206| HNL| OGG| HNL|\n", + "|2015| 1| 1| 4| UA| 1500| ORD| IAH| ORD|\n", + "|2015| 1| 1| 4| AA| 1323| MCO| MIA| MCO|\n", + "|2015| 1| 1| 4| NK| 103| BOS| MYR| BOS|\n", + "|2015| 1| 1| 4| OO| 7404| HIB| MSP| HIB|\n", + "|2015| 1| 1| 4| OO| 7419| ABR| MSP| ABR|\n", + "|2015| 1| 1| 4| OO| 5254| MAF| IAH| MAF|\n", + "|2015| 1| 1| 4| US| 480| SEA| PHX| SEA|\n", + "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+\n", + "only showing top 20 rows\n", + "\n" + ] + } + ], + "source": [ + "feature_store.sql(sql).show()" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "6d72aefa", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "feature_store_yaml = \"\"\"\n", + "apiVersion: v1\n", + "kind: featureStore\n", + "spec:\n", + " displayName: Flights feature store\n", + " compartmentId: \"\"\n", + " offlineConfig:\n", + " metastoreId: \"\"\n", + "\n", + " entity: &flights_entity\n", + " - kind: entity\n", + " spec:\n", + " name: Flights\n", + "\n", + " featureGroup:\n", + " - kind: featureGroup\n", + " spec:\n", + " entity: *flights_entity\n", + " name: flights_feature_group\n", + " primaryKeys:\n", + " - IATA_CODE\n", + " inputFeatureDetails:\n", + " - featureType: STRING\n", + " name: IATA_CODE\n", + " orderNumber: 1\n", + " - featureType: STRING\n", + " name: AIRPORT\n", + " orderNumber: 2\n", + " - featureType: STRING\n", + " name: CITY\n", + " orderNumber: 3\n", + " - featureType: STRING\n", + " name: STATE\n", + " orderNumber: 4\n", + " - featureType: STRING\n", + " name: COUNTRY\n", + " orderNumber: 5\n", + " - featureType: FLOAT\n", + " name: LATITUDE\n", + " orderNumber: 6\n", + " - featureType: FLOAT\n", + " name: LONGITUDE\n", + " orderNumber: 7\n", + " - kind: featureGroup\n", + " spec:\n", + " entity: *flights_entity\n", + " name: airlines_feature_group\n", + " primaryKeys:\n", + " - IATA_CODE\n", + " inputFeatureDetails:\n", + " - featureType: STRING\n", + " name: IATA_CODE\n", + " orderNumber: 1\n", + " - featureType: STRING\n", + " name: AIRPORT\n", + " orderNumber: 2\n", + " - featureType: STRING\n", + " name: CITY\n", + " orderNumber: 3\n", + " - featureType: STRING\n", + " name: STATE\n", + " orderNumber: 4\n", + " - featureType: STRING\n", + " name: COUNTRY\n", + " orderNumber: 5\n", + " - featureType: FLOAT\n", + " name: LATITUDE\n", + " orderNumber: 6\n", + " - featureType: FLOAT\n", + " name: LONGITUDE\n", + " orderNumber: 7\n", + "\n", + " - kind: featureGroup\n", + " spec:\n", + " entity: *flights_entity\n", + " name: airport_feature_group\n", + " primaryKeys:\n", + " - IATA_CODE\n", + " inputFeatureDetails:\n", + " - featureType: STRING\n", + " name: IATA_CODE\n", + " orderNumber: 1\n", + " - featureType: STRING\n", + " name: AIRLINE\n", + " orderNumber: 2\n", + " dataset:\n", + " - kind: dataset\n", + " spec:\n", + " name: flights_dataset\n", + " entity: *flights_entity\n", + " description: \"Dataset for flights\"\n", + " query: 'SELECT flight.IATA_CODE, flight.AIRPORT FROM flights_feature_group flight'\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "23bc53a4", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "fd2434312d73436fac996ff64f4f50f5", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "loop1: 0%| | 0/6 [00:00\n", + "# References\n", + "\n", + "- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)\n", + "- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)\n", + "- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)\n", + "- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "914eafdd", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [conda env:fspyspark32_p38_cpu#conda_v1]", + "language": "python", + "name": "conda-env-fspyspark32_p38_cpu_conda_v1-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.17" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebook_examples/feature_store_quickstart.ipynb b/notebook_examples/feature_store_quickstart.ipynb new file mode 100644 index 00000000..403a2af2 --- /dev/null +++ b/notebook_examples/feature_store_quickstart.ipynb @@ -0,0 +1,1940 @@ +{ + "cells": [ + { + "cell_type": "raw", + "id": "4a426ee8", + "metadata": { + "pycharm": { + "name": "#%% raw\n" + } + }, + "source": [ + "@notebook{feature_store-quickstart.ipynb,\n", + " title: Using feature store for feature ingestion and feature querying,\n", + " summary: Feature store quickstart guide to perform feature ingestion and feature querying.,\n", + " developed_on: fs_pyspark32_p38_cpu_v1,\n", + " keywords: feature store,\n", + " license: Universal Permissive License v 1.0\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "9e98a0a2", + "metadata": { + "pycharm": { + "is_executing": true, + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "zsh:1: command not found: odsc\r\n" + ] + } + ], + "source": [ + "# Upgrade Oracle ADS to pick up the latest preview version to maintain compatibility with Oracle Cloud Infrastructure.\n", + "\n", + "!odsc conda install --uri https://objectstorage.us-ashburn-1.oraclecloud.com/n/bigdatadatasciencelarge/b/service-conda-packs-fs/o/service_pack/cpu/PySpark_3.2_and_Feature_Store/1.0/fspyspark32_p38_cpu_v1#conda" + ] + }, + { + "cell_type": "markdown", + "id": "67dc5be9", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "Oracle Data Science service sample notebook.\n", + "\n", + "Copyright (c) 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).\n", + "\n", + "***\n", + "\n", + "# Feature store quickstart\n", + "

by the Oracle Cloud Infrastructure Data Science Service.

\n", + "\n", + "---\n", + "# Overview:\n", + "---\n", + "Managing many datasets, data-sources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift and training serving skew all leads to increased model development time and worse model performance. Here, feature store is well positioned to solve many of the problems since it provides a centralised way to transform and access data for training and serving time and helps defines a standardised pipeline for ingestion of data and querying of data.\n", + "\n", + "## Contents:\n", + "\n", + "- 1. Introduction\n", + "- 2. Pre-requisites\n", + " - 2.1 Policies\n", + " - 2.2 Authentication\n", + " - 2.3 Variables\n", + "- 3. Feature store quickstart using APIs\n", + " - 3.1. Create feature store\n", + " - 3.2. Create business entity in feature store\n", + " - 3.3. Create feature group and upload data to feature group\n", + " - 3.4. Query feature group\n", + " - 3.5. Create dataset from multiple or one feature group\n", + " - 3.6 Query dataset\n", + "- 4. Feature store quickstart using YAML\n", + "- 5. References\n", + "\n", + "---\n", + "\n", + "**Important:**\n", + "\n", + "Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = \"\"` would become `database_name = \"production\"`.\n", + "\n", + "---\n", + "\n", + "Datasets are provided as a convenience. Datasets are considered third-party content and are not considered materials under your agreement with Oracle.\n", + "\n", + "This [`Citi Bike`](https://ride.citibikenyc.com/data-sharing-policy) dataset license is used in this notebook.\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "d41663f1", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "# 1. Introduction\n", + "\n", + "Oracle feature store is a stack based solution that is deployed in the customer enclave using OCI resource manager. Customer can stand up the service with infrastructure in their own tenancy. The service consists of API which are deployed in customer tenancy using resource manager.\n", + "\n", + "The following are some key terms that will help you understand OCI Data Science Feature Store:\n", + "\n", + "\n", + "* **Feature Vector**: Set of feature values for any one primary/identifier key. Eg. All/subset of features of customer id ‘2536’ can be called as one feature vector.\n", + "\n", + "* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.\n", + "\n", + "* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Another way to look at it is that an entity is an object or concept that is described by its features. Examples of entities could be customer, product, transaction, review, image, document, etc.\n", + "\n", + "* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in ml models. It serves as an organizational unit within the feature store for users to manage, version and share features across different ml projects. By organizing features into groups, data scientists and ml engineers can efficiently discover, reuse and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.\n", + "\n", + "* **Feature Group Job**: Feature group job is the execution instance of a feature group. Each feature group job will include validation results and statistics results.\n", + "\n", + "* **Dataset**: A dataset is a collection of feature that are used together to either train a model or perform model inference.\n", + "\n", + "* **Dataset Job**: Dataset job is the execution instance of a dataset. Each dataset job will include validation results and statistics results." + ] + }, + { + "cell_type": "markdown", + "id": "ce2f00ee", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "# 2. Pre-requisites\n", + "\n", + "Notebook Sessions are accessible through the following conda environment: \n", + "\n", + "* **PySpark 3.2 and Feature store 1.0 (fs_pyspark32_p38_cpu_v1)**\n", + "\n", + "You can customize `fs_pyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Notebook session cluster. " + ] + }, + { + "cell_type": "markdown", + "id": "f503e105", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### `spark-defaults.conf`\n", + "\n", + "The `spark-defaults.conf` file is used to define the properties that are used by Spark. A templated version is installed when you install a Data Science conda environment that supports PySpark. However, you must update the template so that the Data Catalog metastore can be accessed. You can do this manually. However, the `odsc data-catalog config` commandline tool is ideal for setting up the file because it gathers information about your environment, and uses that to build the file.\n", + "\n", + "The `odsc data-catalog config` command line tool needs the `--metastore` option to define the Data Catalog metastore OCID. No other command line option is needed because settings have default values, or they take values from your notebook session environment. Following are common parameters that you may need to override.\n", + "\n", + "The `--authentication` option sets the authentication mode. It supports resource principal and API keys. The preferred method for authentication is resource principal, which is sent with `--authentication resource_principal`. If you want to use API keys, then use the `--authentication api_key` option. If the `--authentication` isn't specified, API keys are used. When API keys are used, information from the OCI configuration file is used to create the `spark-defaults.conf` file.\n", + "\n", + "Object Storage and Data Catalog are regional services. By default, the region is set to the region your notebook session is running in. This information is taken from the environment variable, `NB_REGION`. Use the `--region` option to override this behavior.\n", + "\n", + "The default location of the `spark-defaults.conf` file is `/home/datascience/spark_conf_dir` as defined in the `SPARK_CONF_DIR` environment variable. Use the `--output` option to define the directory where to write the file.\n", + "\n", + "You need to determine what settings are appropriate for your configuration. However, the following works for most configurations and is run in a terminal window.\n", + "\n", + "```bash\n", + "odsc data-catalog config --authentication resource_principal --metastore \n", + "```\n", + "For more assistance, use the following command in a terminal window:\n", + "\n", + "```bash\n", + "odsc data-catalog config --help\n", + "```\n", + "\n", + "\n", + "### Session Setup\n", + "\n", + "The notebook makes connections to the Data Catalog metastore and Object Storage. In the next cell, specify the bucket URI to act as the data warehouse. Use the `warehouse_uri` variable with the `oci://@/` format. Update the variable `metastore_id` with the OCID of the Data Catalog metastore." + ] + }, + { + "cell_type": "markdown", + "id": "9a781306", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 2.1. Policies\n", + "This section covers the creation of dynamic groups and policies needed to use the service.\n", + "\n", + "* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)\n", + "* [Data Catalog Metastore Required Policies](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm)" + ] + }, + { + "cell_type": "markdown", + "id": "2c7106e4", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 2.2. Authentication\n", + "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook Spark cluster.
\n", + "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```. " + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "89bdc3aa", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "import ads\n", + "ads.set_auth(auth=\"api_key\", client_kwargs={\"service_endpoint\": \"http://{api_gateway}:21000/20230101\"})" + ] + }, + { + "cell_type": "markdown", + "id": "d7c223c0", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 2.3. Variables\n", + "To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `` and bucket `` for storing logs. The [Data Catalog Hive Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and unstructured data assets. The Metastore is the central metadata repository to understand tables backed by files on object storage and the metastore id of hive metastore is tied to feature store construct of feature store service." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "a2ca06cb", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "compartment_id = os.environ.get(\"NB_SESSION_COMPARTMENT_OCID\")\n", + "metastore_id = \"\"" + ] + }, + { + "cell_type": "markdown", + "id": "03dc9e2c", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "# 3. Feature store quick start using APIs\n", + "By default the **PySpark 3.2, Feature store and Data Flow** conda environment includes pre-installed [great-expectations](https://legacy.docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html) and [deeque](https://github.com/awslabs/deequ) libraries. In an ADS feature store module, you can either use the Python programmatic or YAML interface to define feature store entities. Below section describes how to create feature store entities using programmatic interface." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "3bfeace2", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:py.warnings:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/sql/pandas/utils.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", + "\n", + "WARNING:py.warnings:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/sql/pandas/utils.py:64: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", + "\n", + "WARNING:py.warnings:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/pandas/__init__.py:46: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " LooseVersion(pyarrow.__version__) >= LooseVersion(\"2.0.0\")\n", + "\n", + "WARNING:py.warnings:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/pandas/__init__.py:49: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.\n", + " warnings.warn(\n", + "\n", + "WARNING:py.warnings:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/pandas/groupby.py:49: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pd.__version__) >= LooseVersion(\"1.3.0\"):\n", + "\n", + "ERROR:logger:Please set env variable SPARK_VERSION\n", + "INFO:logger:Using deequ: com.amazon.deequ:deequ:1.2.2-spark-3.0\n" + ] + } + ], + "source": [ + "import pandas as pd \n", + "from ads.feature_store.feature_store import FeatureStore\n", + "from ads.feature_store.dataset import Dataset\n", + "from ads.feature_store.feature_group import FeatureGroup\n", + "from ads.feature_store.feature_store_registrar import FeatureStoreRegistrar\n", + "from ads.feature_store.common.enums import ExpectationType" + ] + }, + { + "cell_type": "markdown", + "id": "2b3fad36", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.1 Create feature store\n", + "Feature store is a top level construct to provide logical segregation of resources" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "4688d55b", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "feature_store_resource = (\n", + " FeatureStore().\n", + " with_description(\"Data consisting of bike riders data\").\n", + " with_compartment_id(compartment_id).\n", + " with_display_name(\"Bike rides\").\n", + " with_offline_config(metastore_id=metastore_id)\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "191d1d31", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "feature_store = feature_store_resource.create()" + ] + }, + { + "cell_type": "markdown", + "id": "0ba52241", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.2 Create entity\n", + "An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Another way to look at it is that an entity is an object or concept that is described by its features. Examples of entities could be customer, product, transaction, review, image, document, etc." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "f3fff48b", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "entity = feature_store.create_entity(\n", + " display_name=\"Bike rides\",\n", + " description=\"description for bike riders\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "8f1d165b", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.3 Create feature group\n", + "A feature group is the code that contains instructions on the ingestion of raw data and computation of the feature. This [`Citi Bike`](https://ride.citibikenyc.com/data-sharing-policy) dataset license is used in this notebook. values. " + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "6aaac72f", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "bike_df = pd.read_csv(\"/data/flights-data/archives/201901-citibike-tripdata.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "47140320", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "bike_df = bike_df.drop(['start station name', 'end station name'], axis=1)\n", + "bike_df.columns = bike_df.columns.str.replace(' ', '')" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "e87a1587", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
tripdurationstarttimestoptimestartstationidstartstationlatitudestartstationlongitudeendstationidendstationlatitudeendstationlongitudebikeidusertypebirthyeargender
03202019-01-01 00:01:47.40102019-01-01 00:07:07.58103160.040.778968-73.9737473283.040.788221-73.97041615839Subscriber19711
13162019-01-01 00:04:43.73602019-01-01 00:10:00.6080519.040.751873-73.977706518.040.747804-73.97344232723Subscriber19641
25912019-01-01 00:06:03.99702019-01-01 00:15:55.43803171.040.785247-73.9766733154.040.773142-73.95856227451Subscriber19871
327192019-01-01 00:07:03.54502019-01-01 00:52:22.6500504.040.732219-73.9816563709.040.738046-73.99643021579Subscriber19901
43032019-01-01 00:07:35.94502019-01-01 00:12:39.5020229.040.727434-73.993790503.040.738274-73.98752035379Subscriber19791
\n", + "
" + ], + "text/plain": [ + " tripduration starttime stoptime \\\n", + "0 320 2019-01-01 00:01:47.4010 2019-01-01 00:07:07.5810 \n", + "1 316 2019-01-01 00:04:43.7360 2019-01-01 00:10:00.6080 \n", + "2 591 2019-01-01 00:06:03.9970 2019-01-01 00:15:55.4380 \n", + "3 2719 2019-01-01 00:07:03.5450 2019-01-01 00:52:22.6500 \n", + "4 303 2019-01-01 00:07:35.9450 2019-01-01 00:12:39.5020 \n", + "\n", + " startstationid startstationlatitude startstationlongitude endstationid \\\n", + "0 3160.0 40.778968 -73.973747 3283.0 \n", + "1 519.0 40.751873 -73.977706 518.0 \n", + "2 3171.0 40.785247 -73.976673 3154.0 \n", + "3 504.0 40.732219 -73.981656 3709.0 \n", + "4 229.0 40.727434 -73.993790 503.0 \n", + "\n", + " endstationlatitude endstationlongitude bikeid usertype birthyear \\\n", + "0 40.788221 -73.970416 15839 Subscriber 1971 \n", + "1 40.747804 -73.973442 32723 Subscriber 1964 \n", + "2 40.773142 -73.958562 27451 Subscriber 1987 \n", + "3 40.738046 -73.996430 21579 Subscriber 1990 \n", + "4 40.738274 -73.987520 35379 Subscriber 1979 \n", + "\n", + " gender \n", + "0 1 \n", + "1 1 \n", + "2 1 \n", + "3 1 \n", + "4 1 " + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bike_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "e704bb08", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{\"expectation_type\": \"expect_column_values_to_not_be_null\", \"meta\": {}, \"kwargs\": {\"column\": \"stoptime\"}}" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from great_expectations.core import ExpectationSuite, ExpectationConfiguration\n", + "\n", + "expectation_suite = ExpectationSuite(expectation_suite_name=\"feature_definition\")\n", + "expectation_suite.add_expectation(\n", + " ExpectationConfiguration(\n", + " expectation_type=\"expect_column_values_to_not_be_null\",\n", + " kwargs={\"column\": \"stoptime\"}\n", + " )\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "02d6fc25", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "feature_group_bike = (\n", + " FeatureGroup()\n", + " .with_feature_store_id(feature_store.id)\n", + " .with_primary_keys([\"bikeid\"])\n", + " .with_name(\"bike_feature_group\")\n", + " .with_entity_id(entity.id)\n", + " .with_compartment_id(compartment_id)\n", + " .with_schema_details_from_dataframe(bike_df)\n", + " .with_expectation_suite(expectation_suite, ExpectationType.LENIENT)\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "228401d1", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "kind: FeatureGroup\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\n", + " entityId: 1C29D0DF65E456211B7351D85F271E03\n", + " expectationDetails:\n", + " createRuleDetails:\n", + " - arguments:\n", + " column: stoptime\n", + " levelType: ERROR\n", + " name: Rule-0\n", + " ruleType: expect_column_values_to_not_be_null\n", + " expectationType: LENIENT\n", + " name: feature_definition\n", + " validationEngineType: GREAT_EXPECTATIONS\n", + " featureStoreId: AB5F8E0C4BD86255C3828039D8C51853\n", + " id: 60E6662F04168EEFE781D7ACE576F339\n", + " inputFeatureDetails:\n", + " - featureType: INTEGER\n", + " name: tripduration\n", + " orderNumber: 1\n", + " - featureType: STRING\n", + " name: starttime\n", + " orderNumber: 2\n", + " - featureType: STRING\n", + " name: stoptime\n", + " orderNumber: 3\n", + " - featureType: FLOAT\n", + " name: startstationid\n", + " orderNumber: 4\n", + " - featureType: FLOAT\n", + " name: startstationlatitude\n", + " orderNumber: 5\n", + " - featureType: FLOAT\n", + " name: startstationlongitude\n", + " orderNumber: 6\n", + " - featureType: FLOAT\n", + " name: endstationid\n", + " orderNumber: 7\n", + " - featureType: FLOAT\n", + " name: endstationlatitude\n", + " orderNumber: 8\n", + " - featureType: FLOAT\n", + " name: endstationlongitude\n", + " orderNumber: 9\n", + " - featureType: INTEGER\n", + " name: bikeid\n", + " orderNumber: 10\n", + " - featureType: STRING\n", + " name: usertype\n", + " orderNumber: 11\n", + " - featureType: INTEGER\n", + " name: birthyear\n", + " orderNumber: 12\n", + " - featureType: INTEGER\n", + " name: gender\n", + " orderNumber: 13\n", + " name: bike_feature_group\n", + " primaryKeys:\n", + " items:\n", + " - name: bikeid\n", + " statisticsConfig:\n", + " isEnabled: true\n", + "type: featureGroup" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_group_bike.create()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "98afef8e", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "os.environ[\"DEVELOPER_MODE\"] = \"True\"" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "732e20e8", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + ":: loading settings :: url = jar:file:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Ivy Default Cache set to: /Users/kshitizlohia/.ivy2/cache\n", + "The jars for the packages stored in: /Users/kshitizlohia/.ivy2/jars\n", + "io.delta#delta-core_2.12 added as a dependency\n", + ":: resolving dependencies :: org.apache.spark#spark-submit-parent-e96bd2ce-ad22-46d2-bd46-aa51029113aa;1.0\n", + "\tconfs: [default]\n", + "\tfound io.delta#delta-core_2.12;2.3.0 in central\n", + "\tfound io.delta#delta-storage;2.3.0 in central\n", + "\tfound org.antlr#antlr4-runtime;4.8 in local-m2-cache\n", + ":: resolution report :: resolve 137ms :: artifacts dl 25ms\n", + "\t:: modules in use:\n", + "\tio.delta#delta-core_2.12;2.3.0 from central in [default]\n", + "\tio.delta#delta-storage;2.3.0 from central in [default]\n", + "\torg.antlr#antlr4-runtime;4.8 from local-m2-cache in [default]\n", + "\t---------------------------------------------------------------------\n", + "\t| | modules || artifacts |\n", + "\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n", + "\t---------------------------------------------------------------------\n", + "\t| default | 3 | 0 | 0 | 0 || 3 | 0 |\n", + "\t---------------------------------------------------------------------\n", + ":: retrieving :: org.apache.spark#spark-submit-parent-e96bd2ce-ad22-46d2-bd46-aa51029113aa\n", + "\tconfs: [default]\n", + "\t0 artifacts copied, 3 already retrieved (0kB/8ms)\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "23/05/16 18:29:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Setting default log level to \"WARN\".\n", + "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:py.warnings:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/sql/pandas/utils.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", + "\n", + "WARNING:py.warnings:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/sql/pandas/conversion.py:474: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.\n", + " for column, series in pdf.iteritems():\n", + "\n", + "WARNING:py.warnings:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/sql/pandas/conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.\n", + " for column, series in pdf.iteritems():\n", + "\n", + "INFO:great_expectations.validator.validator:\t1 expectation(s) included in expectation_suite.\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "64ddfd3353dd457c99630a61d89fe748", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Calculating Metrics: 0%| | 0/6 [00:00 (0 + 8) / 8]\r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "23/05/16 18:30:05 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory\n", + "Scaling row group sizes to 96.54% for 7 writers\n", + "23/05/16 18:30:05 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory\n", + "Scaling row group sizes to 84.47% for 8 writers\n", + "23/05/16 18:30:07 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory\n", + "Scaling row group sizes to 96.54% for 7 writers\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\r", + " \r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "23/05/16 18:30:11 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "23/05/16 18:30:15 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `1c29d0df65e456211b7351d85f271e03`.`bike_feature_group` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + } + ], + "source": [ + "feature_group_bike.materialise(bike_df)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "711efb2e", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
endstationlongitudetripdurationbikeidstartstationlongitudeendstationidusertypestarttimestartstationidendstationlatitudestartstationlatitudebirthyearstoptimegender
completeness1.01.01.01.01.01.01.01.01.01.01.01.01.0
approximateNumDistinctValues83929483932104858986361013
dataTypeFractionalIntegralIntegralFractionalFractionalStringStringFractionalFractionalFractionalIntegralStringIntegral
sum-7398.15000476840.02914421.0-7398.157728155797.0NaNNaN186276.04074.015994074.092498198127.0NaN118.0
min-74.01658497.014656.0-74.012723127.0NaNNaN79.040.66860340.6681271949.0NaN0.0
max-73.9419953494.035789.0-73.9422373709.0NaNNaN3675.040.81079240.8042131999.0NaN2.0
mean-73.9815768.429144.21-73.9815771557.97NaNNaN1862.7640.7401640.7409251981.27NaN1.18
stddev0.018151686.1878466319.2343260.0174651428.093551NaNNaN1438.055320.0318280.0325911.713117NaN0.497594
\n", + "
" + ], + "text/plain": [ + " endstationlongitude tripduration bikeid \\\n", + "completeness 1.0 1.0 1.0 \n", + "approximateNumDistinctValues 83 92 94 \n", + "dataType Fractional Integral Integral \n", + "sum -7398.150004 76840.0 2914421.0 \n", + "min -74.016584 97.0 14656.0 \n", + "max -73.941995 3494.0 35789.0 \n", + "mean -73.9815 768.4 29144.21 \n", + "stddev 0.018151 686.187846 6319.234326 \n", + "\n", + " startstationlongitude endstationid usertype \\\n", + "completeness 1.0 1.0 1.0 \n", + "approximateNumDistinctValues 83 93 2 \n", + "dataType Fractional Fractional String \n", + "sum -7398.157728 155797.0 NaN \n", + "min -74.012723 127.0 NaN \n", + "max -73.942237 3709.0 NaN \n", + "mean -73.981577 1557.97 NaN \n", + "stddev 0.017465 1428.093551 NaN \n", + "\n", + " starttime startstationid endstationlatitude \\\n", + "completeness 1.0 1.0 1.0 \n", + "approximateNumDistinctValues 104 85 89 \n", + "dataType String Fractional Fractional \n", + "sum NaN 186276.0 4074.01599 \n", + "min NaN 79.0 40.668603 \n", + "max NaN 3675.0 40.810792 \n", + "mean NaN 1862.76 40.74016 \n", + "stddev NaN 1438.05532 0.031828 \n", + "\n", + " startstationlatitude birthyear stoptime \\\n", + "completeness 1.0 1.0 1.0 \n", + "approximateNumDistinctValues 86 36 101 \n", + "dataType Fractional Integral String \n", + "sum 4074.092498 198127.0 NaN \n", + "min 40.668127 1949.0 NaN \n", + "max 40.804213 1999.0 NaN \n", + "mean 40.740925 1981.27 NaN \n", + "stddev 0.03259 11.713117 NaN \n", + "\n", + " gender \n", + "completeness 1.0 \n", + "approximateNumDistinctValues 3 \n", + "dataType Integral \n", + "sum 118.0 \n", + "min 0.0 \n", + "max 2.0 \n", + "mean 1.18 \n", + "stddev 0.497594 " + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_group_bike.get_statistics().to_pandas()" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "5bfcded2", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
successresultsstatistics.evaluated_expectationsstatistics.successful_expectationsstatistics.unsuccessful_expectationsstatistics.success_percentmeta.great_expectations_versionmeta.expectation_suite_namemeta.run_id.run_timemeta.run_id.run_namemeta.batch_markers.ge_load_timemeta.active_batch_definition.datasource_namemeta.active_batch_definition.data_connector_namemeta.active_batch_definition.data_asset_namemeta.active_batch_definition.batch_identifiers.ge_batch_idmeta.validation_timemeta.checkpoint_name
0True[{'expectation_config': {'expectation_type': 'expect_column_values_to_not_be_null', 'meta': {}, 'kwargs': {'column': 'stoptime', 'batch_id': 'feca776acdd0aa61ae53da7b674430a1'}}, 'exception_info': {'raised_exception': False, 'exception_traceback': None, 'exception_message': None}, 'result': {'element_count': 100, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'partial_unexpected_list': []}, 'success': True, 'meta': {}}]110100.00.16.10bike_feature_group2023-05-16T18:29:58.670292+05:30None20230516T125958.669418Zfeature-ingestion-pipelinefeature-ingestion-pipelinefeature-ingestion-pipeline8ff83c32-f3e9-11ed-aedd-b29c4acce13020230516T125958.670193ZNone
\n", + "
" + ], + "text/plain": [ + " success \\\n", + "0 True \n", + "\n", + " results \\\n", + "0 [{'expectation_config': {'expectation_type': 'expect_column_values_to_not_be_null', 'meta': {}, 'kwargs': {'column': 'stoptime', 'batch_id': 'feca776acdd0aa61ae53da7b674430a1'}}, 'exception_info': {'raised_exception': False, 'exception_traceback': None, 'exception_message': None}, 'result': {'element_count': 100, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'partial_unexpected_list': []}, 'success': True, 'meta': {}}] \n", + "\n", + " statistics.evaluated_expectations statistics.successful_expectations \\\n", + "0 1 1 \n", + "\n", + " statistics.unsuccessful_expectations statistics.success_percent \\\n", + "0 0 100.0 \n", + "\n", + " meta.great_expectations_version meta.expectation_suite_name \\\n", + "0 0.16.10 bike_feature_group \n", + "\n", + " meta.run_id.run_time meta.run_id.run_name \\\n", + "0 2023-05-16T18:29:58.670292+05:30 None \n", + "\n", + " meta.batch_markers.ge_load_time \\\n", + "0 20230516T125958.669418Z \n", + "\n", + " meta.active_batch_definition.datasource_name \\\n", + "0 feature-ingestion-pipeline \n", + "\n", + " meta.active_batch_definition.data_connector_name \\\n", + "0 feature-ingestion-pipeline \n", + "\n", + " meta.active_batch_definition.data_asset_name \\\n", + "0 feature-ingestion-pipeline \n", + "\n", + " meta.active_batch_definition.batch_identifiers.ge_batch_id \\\n", + "0 8ff83c32-f3e9-11ed-aedd-b29c4acce130 \n", + "\n", + " meta.validation_time meta.checkpoint_name \n", + "0 20230516T125958.670193Z None " + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_group_bike.get_validation_output_df()" + ] + }, + { + "cell_type": "markdown", + "id": "b7ba161c", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.4 Query feature group\n", + "Feature store provides a DataFrame API to ingest data into the Feature Store. You can also retrieve feature data in a DataFrame, that can either be used directly to train models or materialized to file(s) for later use to train models" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "c175849c", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+------------+--------------------+--------------------+--------------+--------------------+---------------------+------------+------------------+-------------------+------+----------+---------+------+\n", + "|tripduration| starttime| stoptime|startstationid|startstationlatitude|startstationlongitude|endstationid|endstationlatitude|endstationlongitude|bikeid| usertype|birthyear|gender|\n", + "+------------+--------------------+--------------------+--------------+--------------------+---------------------+------------+------------------+-------------------+------+----------+---------+------+\n", + "| 976|2019-01-01 00:15:...|2019-01-01 00:31:...| 3452.0| 40.71915571696044| -73.94885390996933| 251.0| 40.72317958| -73.99480012| 35685|Subscriber| 1994| 1|\n", + "| 97|2019-01-01 00:15:...|2019-01-01 00:17:...| 3430.0| 40.71907891179564| -73.94223690032959| 3095.0| 40.71929301| -73.94500379| 34307|Subscriber| 1988| 1|\n", + "| 467|2019-01-01 00:16:...|2019-01-01 00:24:...| 507.0| 40.73912601| -73.97973776| 492.0| 40.75019995| -73.99093085| 35561|Subscriber| 1989| 1|\n", + "| 348|2019-01-01 00:17:...|2019-01-01 00:23:...| 3095.0| 40.71929301| -73.94500379| 3101.0| 40.72079821| -73.95484712| 35695|Subscriber| 1988| 1|\n", + "| 505|2019-01-01 00:18:...|2019-01-01 00:27:...| 3132.0| 40.76350532| -73.97109243| 359.0| 40.75510267| -73.97498696| 31801|Subscriber| 1981| 1|\n", + "| 3494|2019-01-01 00:18:...|2019-01-01 01:17:...| 3171.0| 40.78524672| -73.97667321| 3164.0| 40.7770575| -73.97898475| 35785|Subscriber| 1954| 1|\n", + "| 829|2019-01-01 00:19:...|2019-01-01 00:32:...| 3165.0| 40.77579376683666| -73.9762057363987| 3295.0| 40.79127| -73.964839| 32106|Subscriber| 1969| 0|\n", + "| 451|2019-01-01 00:21:...|2019-01-01 00:28:...| 403.0| 40.72502876| -73.99069656| 545.0| 40.736502| -73.97809472| 32038|Subscriber| 1985| 1|\n", + "| 736|2019-01-01 00:21:...|2019-01-01 00:33:...| 3165.0| 40.77579376683666| -73.9762057363987| 3295.0| 40.79127| -73.964839| 16761| Customer| 1989| 2|\n", + "| 617|2019-01-01 00:21:...|2019-01-01 00:31:...| 3159.0| 40.77492513| -73.98266566| 3142.0| 40.7612274| -73.96094022| 24895|Subscriber| 1998| 1|\n", + "+------------+--------------------+--------------------+--------------+--------------------+---------------------+------------+------------------+-------------------+------+----------+---------+------+\n", + "only showing top 10 rows\n", + "\n" + ] + } + ], + "source": [ + "query = feature_group_bike.select() \n", + "query.show()" + ] + }, + { + "cell_type": "markdown", + "id": "962e563d", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.5 Create dataset\n", + "A dataset is a collection of feature snapshots that are joined together to either train a model or perform model inference." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "147ae5bd", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'SELECT fg_0.tripduration tripduration, fg_0.starttime starttime, fg_0.stoptime stoptime, fg_0.startstationid startstationid, fg_0.startstationlatitude startstationlatitude, fg_0.startstationlongitude startstationlongitude, fg_0.endstationid endstationid, fg_0.endstationlatitude endstationlatitude, fg_0.endstationlongitude endstationlongitude, fg_0.bikeid bikeid, fg_0.usertype usertype, fg_0.birthyear birthyear, fg_0.gender gender FROM `1C29D0DF65E456211B7351D85F271E03`.bike_feature_group fg_0'" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "query.to_string()" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "440b129e", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "dataset_resource = (\n", + " Dataset()\n", + " .with_description(\"Dataset consisting of a subset of features in feature group: bike riders\")\n", + " .with_compartment_id(compartment_id)\n", + " .with_name(\"bike_riders_dataset\")\n", + " .with_entity_id(entity.id)\n", + " .with_feature_store_id(feature_store.id)\n", + " .with_query(query.to_string())\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "10dd5758", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "dataset = dataset_resource.create()" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "d4b077da", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "23/05/16 18:31:37 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `1c29d0df65e456211b7351d85f271e03`.`bike_riders_dataset` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + } + ], + "source": [ + "dataset.materialise()" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "db5d6854", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
endstationlongitudetripdurationbikeidstartstationlongitudeendstationidusertypestarttimestartstationidendstationlatitudestartstationlatitudebirthyearstoptimegender
completeness1.01.01.01.01.01.01.01.01.01.01.01.01.0
approximateNumDistinctValues83929483932104858986361013
dataTypeFractionalIntegralIntegralFractionalFractionalStringStringFractionalFractionalFractionalIntegralStringIntegral
sum-7398.15000476840.02914421.0-7398.157728155797.0NaNNaN186276.04074.015994074.092498198127.0NaN118.0
min-74.01658497.014656.0-74.012723127.0NaNNaN79.040.66860340.6681271949.0NaN0.0
max-73.9419953494.035789.0-73.9422373709.0NaNNaN3675.040.81079240.8042131999.0NaN2.0
mean-73.9815768.429144.21-73.9815771557.97NaNNaN1862.7640.7401640.7409251981.27NaN1.18
stddev0.018151686.1878466319.2343260.0174651428.093551NaNNaN1438.055320.0318280.0325911.713117NaN0.497594
\n", + "
" + ], + "text/plain": [ + " endstationlongitude tripduration bikeid \\\n", + "completeness 1.0 1.0 1.0 \n", + "approximateNumDistinctValues 83 92 94 \n", + "dataType Fractional Integral Integral \n", + "sum -7398.150004 76840.0 2914421.0 \n", + "min -74.016584 97.0 14656.0 \n", + "max -73.941995 3494.0 35789.0 \n", + "mean -73.9815 768.4 29144.21 \n", + "stddev 0.018151 686.187846 6319.234326 \n", + "\n", + " startstationlongitude endstationid usertype \\\n", + "completeness 1.0 1.0 1.0 \n", + "approximateNumDistinctValues 83 93 2 \n", + "dataType Fractional Fractional String \n", + "sum -7398.157728 155797.0 NaN \n", + "min -74.012723 127.0 NaN \n", + "max -73.942237 3709.0 NaN \n", + "mean -73.981577 1557.97 NaN \n", + "stddev 0.017465 1428.093551 NaN \n", + "\n", + " starttime startstationid endstationlatitude \\\n", + "completeness 1.0 1.0 1.0 \n", + "approximateNumDistinctValues 104 85 89 \n", + "dataType String Fractional Fractional \n", + "sum NaN 186276.0 4074.01599 \n", + "min NaN 79.0 40.668603 \n", + "max NaN 3675.0 40.810792 \n", + "mean NaN 1862.76 40.74016 \n", + "stddev NaN 1438.05532 0.031828 \n", + "\n", + " startstationlatitude birthyear stoptime \\\n", + "completeness 1.0 1.0 1.0 \n", + "approximateNumDistinctValues 86 36 101 \n", + "dataType Fractional Integral String \n", + "sum 4074.092498 198127.0 NaN \n", + "min 40.668127 1949.0 NaN \n", + "max 40.804213 1999.0 NaN \n", + "mean 40.740925 1981.27 NaN \n", + "stddev 0.03259 11.713117 NaN \n", + "\n", + " gender \n", + "completeness 1.0 \n", + "approximateNumDistinctValues 3 \n", + "dataType Integral \n", + "sum 118.0 \n", + "min 0.0 \n", + "max 2.0 \n", + "mean 1.18 \n", + "stddev 0.497594 " + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dataset.get_statistics().to_pandas()" + ] + }, + { + "cell_type": "markdown", + "id": "38da2a60", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "# 4. Feature store quick start using YAML\n", + "In an ADS feature store module, you can either use the Python programmatic interface or YAML to define feature store entities. Below section describes how to create feature store entities using YAML as an interface." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "d3aa939e", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "feature_store_yaml = \"\"\"\n", + "apiVersion: v1\n", + "kind: featureStore\n", + "spec:\n", + " displayName: Bike feature store\n", + " compartmentId: \"ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\"\n", + " offlineConfig:\n", + " metastoreId: \"ocid1.datacatalogmetastore.oc1.iad.amaaaaaabiudgxyap7tizm4gscwz7amu7dixz7ml3mtesqzzwwg3urvvdgua\"\n", + "\n", + " entity: &bike_entity\n", + " - kind: entity\n", + " spec:\n", + " name: Bike rides\n", + "\n", + " featureGroup:\n", + " - kind: featureGroup\n", + " spec:\n", + " entity: *bike_entity\n", + " name: bike_feature_group\n", + " primaryKeys:\n", + " - bikeid\n", + " inputFeatureDetails:\n", + " - name: \"bikeid\"\n", + " featureType: \"INTEGER\"\n", + " orderNumber: 1\n", + " cast: \"STRING\"\n", + " - name: \"endstationlongitude\"\n", + " featureType: \"FLOAT\"\n", + " orderNumber: 2\n", + " cast: \"STRING\"\n", + " - name: \"tripduration\"\n", + " featureType: \"INTEGER\"\n", + " orderNumber: 3\n", + " cast: \"STRING\"\n", + "\n", + " dataset:\n", + " - kind: dataset\n", + " spec:\n", + " name: bike_dataset\n", + " entity: *bike_entity\n", + " description: \"Dataset for bike\"\n", + " query: 'SELECT bike.bikeid, bike.endstationlongitude FROM bike_feature_group bike'\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "238a8507", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "75021638e00044e09f9dfa4e15aa6ce9", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "loop1: 0%| | 0/4 [00:00\n", + "# References\n", + "\n", + "- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)\n", + "- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)\n", + "- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)\n", + "- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebook_examples/feature_store_schema_evolution.ipynb b/notebook_examples/feature_store_schema_evolution.ipynb new file mode 100644 index 00000000..940bca73 --- /dev/null +++ b/notebook_examples/feature_store_schema_evolution.ipynb @@ -0,0 +1,3546 @@ +{ + "cells": [ + { + "cell_type": "raw", + "id": "01cacd8a", + "metadata": {}, + "source": [ + "qweews@notebook{feature_store-querying.ipynb,\n", + " title: Using feature store for feature querying using pandas like interface for query and join,\n", + " summary: Feature store quickstart guide to perform feature querying using pandas like interface for query and join.,\n", + " developed_on: pyspark32_p38_cpu_feature_store_v1,\n", + " keywords: feature store, querying,\n", + " license: Universal Permissive License v 1.0\n", + "}" + ] + }, + { + "cell_type": "raw", + "id": "dba1f334", + "metadata": { + "ExecuteTime": { + "end_time": "2023-05-24T08:26:08.572567Z", + "start_time": "2023-05-24T08:26:08.328013Z" + } + }, + "source": [ + "!odsc conda install --uri https://objectstorage.us-ashburn-1.oraclecloud.com/n/bigdatadatasciencelarge/b/service-conda-packs-fs/o/service_pack/cpu/PySpark_3.2_and_Feature_Store/1.0/fspyspark32_p38_cpu_v1" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "572d752e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/bin/bash: -c: line 0: syntax error near unexpected token `newline'\n", + "/bin/bash: -c: line 0: `odsc data-catalog config --authentication resource_principal --metastore '\n" + ] + } + ], + "source": [ + "!odsc data-catalog config --authentication resource_principal --metastore " + ] + }, + { + "cell_type": "markdown", + "id": "ebe05d00", + "metadata": {}, + "source": [ + "Oracle Data Science service sample notebook.\n", + "\n", + "Copyright (c) 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).\n", + "\n", + "***\n", + "\n", + "# Schema enforcement and schema evolution\n", + "

by the Oracle Cloud Infrastructure Data Science Service.

\n", + "\n", + "---\n", + "# Overview:\n", + "---\n", + "Managing many datasets, data-sources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift and training serving skew all leads to increased model development time and worse model performance. Here, feature store is well positioned to solve many of the problems since it provides a centralised way to transform and access data for training and serving time and helps defines a standardised pipeline for ingestion of data and querying of data. This notebook demonstrates how to use feature store within a long lasting [Oracle Cloud Infrastructure Data Flow](https://docs.oracle.com/en-us/iaas/data-flow/using/home.htm) cluster.\n", + "\n", + "Compatible conda pack: [PySpark 3.2 and Feature store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8\n", + "\n", + "
\n", + " \n", + "
\n", + "\n", + "## Contents:\n", + "\n", + "- 1. Introduction\n", + "- 1. Pre-requisites\n", + " - 2.1 Policies\n", + " - 2.2 Authentication\n", + " - 2.3 Variables\n", + "- 3. Schema enforcement and schema evolution\n", + " - 3.1. Exploration of data in feature store\n", + " - 3.2. Create feature store logical entities\n", + " - 3.3. Schema enforcement\n", + " - 3.4. Ingestion Modes\n", + " - 3.4.1 Append\n", + " - 3.4.2 Overwrite\n", + " - 3.4.3 Upsert\n", + "- 4. References\n", + "\n", + "---\n", + "\n", + "**Important:**\n", + "\n", + "Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = \"\"` would become `database_name = \"production\"`.\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "6854cd38", + "metadata": {}, + "source": [ + "\n", + "# 1. Introduction\n", + "\n", + "Oracle feature store is a stack based solution that is deployed in the customer enclave using OCI resource manager. Customer can stand up the service with infrastructure in their own tenancy. The service consists of API which are deployed in customer tenancy using resource manager.\n", + "\n", + "The following are some key terms that will help you understand OCI Data Science Feature Store:\n", + "\n", + "\n", + "* **Feature Vector**: Set of feature values for any one primary/identifier key. Eg. All/subset of features of customer id ‘2536’ can be called as one feature vector.\n", + "\n", + "* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.\n", + "\n", + "* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Another way to look at it is that an entity is an object or concept that is described by its features. Examples of entities could be customer, product, transaction, review, image, document, etc.\n", + "\n", + "* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in ml models. It serves as an organizational unit within the feature store for users to manage, version and share features across different ml projects. By organizing features into groups, data scientists and ml engineers can efficiently discover, reuse and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.\n", + "\n", + "* **Feature Group Job**: Feature group job is the execution instance of a feature group. Each feature group job will include validation results and statistics results.\n", + "\n", + "* **Dataset**: A dataset is a collection of feature that are used together to either train a model or perform model inference.\n", + "\n", + "* **Dataset Job**: Dataset job is the execution instance of a dataset. Each dataset job will include validation results and statistics results." + ] + }, + { + "cell_type": "markdown", + "id": "ae2fdf26", + "metadata": {}, + "source": [ + "\n", + "# 2. Pre-requisites\n", + "\n", + "Data Flow Sessions are accessible through the following conda environment:\n", + "\n", + "* **PySpark 3.2, Feature store 1.0 and Data Flow 1.0 (fs_pyspark32_p38_cpu_v1)**\n", + "\n", + "The [Data Catalog Hive Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and unstructured data assets. The Metastore is the central metadata repository to understand tables backed by files on object storage. You can customize `fs_pyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Data Flow session cluster. The metastore id of hive metastore is tied to feature store construct of feature store service.\n" + ] + }, + { + "cell_type": "markdown", + "id": "486e5d3f", + "metadata": {}, + "source": [ + "\n", + "### `spark-defaults.conf`\n", + "\n", + "The `spark-defaults.conf` file is used to define the properties that are used by Spark. A templated version is installed when you install a Data Science conda environment that supports PySpark. However, you must update the template so that the Data Catalog metastore can be accessed. You can do this manually. However, the `odsc data-catalog config` commandline tool is ideal for setting up the file because it gathers information about your environment, and uses that to build the file.\n", + "\n", + "The `odsc data-catalog config` command line tool needs the `--metastore` option to define the Data Catalog metastore OCID. No other command line option is needed because settings have default values, or they take values from your notebook session environment. Following are common parameters that you may need to override.\n", + "\n", + "The `--authentication` option sets the authentication mode. It supports resource principal and API keys. The preferred method for authentication is resource principal, which is sent with `--authentication resource_principal`. If you want to use API keys, then use the `--authentication api_key` option. If the `--authentication` isn't specified, API keys are used. When API keys are used, information from the OCI configuration file is used to create the `spark-defaults.conf` file.\n", + "\n", + "Object Storage and Data Catalog are regional services. By default, the region is set to the region your notebook session is running in. This information is taken from the environment variable, `NB_REGION`. Use the `--region` option to override this behavior.\n", + "\n", + "The default location of the `spark-defaults.conf` file is `/home/datascience/spark_conf_dir` as defined in the `SPARK_CONF_DIR` environment variable. Use the `--output` option to define the directory where to write the file.\n", + "\n", + "You need to determine what settings are appropriate for your configuration. However, the following works for most configurations and is run in a terminal window.\n", + "\n", + "```bash\n", + "odsc data-catalog config --authentication resource_principal --metastore \n", + "```\n", + "For more assistance, use the following command in a terminal window:\n", + "\n", + "```bash\n", + "odsc data-catalog config --help\n", + "```\n", + "\n", + "\n", + "### Session Setup\n", + "\n", + "The notebook makes connections to the Data Catalog metastore and Object Storage. In the next cell, specify the bucket URI to act as the data warehouse. Use the `warehouse_uri` variable with the `oci://@/` format. Update the variable `metastore_id` with the OCID of the Data Catalog metastore." + ] + }, + { + "cell_type": "markdown", + "id": "367ba357", + "metadata": {}, + "source": [ + "\n", + "### 2.1. Policies\n", + "This section covers the creation of dynamic groups and policies needed to use the service.\n", + "\n", + "* [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm/)\n", + "* [Data Catalog Metastore Required Policies](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm)\n", + "* [Getting Started with Data Flow](https://docs.oracle.com/iaas/data-flow/using/dfs_getting_started.htm)\n", + "* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)" + ] + }, + { + "cell_type": "markdown", + "id": "52bea9cf", + "metadata": {}, + "source": [ + "\n", + "### 2.2. Authentication\n", + "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook cluster.
\n", + "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "ac079f4b", + "metadata": { + "ExecuteTime": { + "start_time": "2023-05-24T08:26:08.577504Z" + }, + "is_executing": true, + "pycharm": { + "is_executing": true + } + }, + "outputs": [], + "source": [ + "import ads\n", + "ads.set_auth(auth=\"resource_principal\", client_kwargs={\"service_endpoint\": \"https://bi3jfhvilwl7gelzjbv3ovim2m.apigateway.us-ashburn-1.oci.customer-oci.com/20230101\"})" + ] + }, + { + "cell_type": "markdown", + "id": "4df685c7", + "metadata": {}, + "source": [ + "\n", + "### 2.3. Variables\n", + "To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `` and bucket `` for offline feature store." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "963224c0", + "metadata": { + "pycharm": { + "is_executing": true + } + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "compartment_id = \"ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\"\n", + "metastore_id = \"ocid1.datacatalogmetastore.oc1.iad.amaaaaaabiudgxyap7tizm4gscwz7amu7dixz7ml3mtesqzzwwg3urvvdgua\"" + ] + }, + { + "cell_type": "markdown", + "id": "e3087df9", + "metadata": {}, + "source": [ + "\n", + "# 3. Schema enforcement and schema evolution\n", + "By default the **PySpark 3.2, Feature store and Data Flow** conda environment includes pre-installed [great-expectations](https://legacy.docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html) and [deeque](https://github.com/awslabs/deequ) libraries. The joining functionality is heavily inspired by the APIs used by Pandas to merge, join or filter DataFrames. The APIs allow you to specify which features to select from which feature group, how to join them and which features to use in join conditions.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "9f18611c", + "metadata": { + "pycharm": { + "is_executing": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/ads/model/deployment/model_deployment.py:48: DeprecationWarning: The `ads.model.deployment.model_deployment_properties` is deprecated in `oracle-ads 2.8.6` and will be removed in `oracle-ads 3.0`.Use `ModelDeploymentInfrastructure` and `ModelDeploymentRuntime` classes in `ads.model.deployment` module for configuring model deployment. Check https://accelerated-data-science.readthedocs.io/en/latest/user_guide/model_registration/introduction.html\n", + " from .model_deployment_properties import ModelDeploymentProperties\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/ads/model/deployment/__init__.py:7: DeprecationWarning: The `ads.model.deployment.model_deployer` is deprecated in `oracle-ads 2.8.6` and will be removed in `oracle-ads 3.0`.Use `ModelDeployment` class in `ads.model.deployment` module for initializing and deploying model deployment. Check https://accelerated-data-science.readthedocs.io/en/latest/user_guide/model_registration/introduction.html\n", + " from .model_deployer import ModelDeployer\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/pandas/__init__.py:44: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " LooseVersion(pyarrow.__version__) >= LooseVersion(\"2.0.0\")\n", + "\n", + "WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/pandas/frame.py:62: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pd.__version__) >= LooseVersion(\"0.24\"):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/pandas/missing/frame.py:81: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pd.__version__) < LooseVersion(\"1.0\"):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/pandas/missing/indexes.py:85: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pd.__version__) < LooseVersion(\"1.0\"):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/pandas/missing/indexes.py:191: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pd.__version__) < LooseVersion(\"1.0\"):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/pandas/missing/series.py:89: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pd.__version__) < LooseVersion(\"1.0\"):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/pandas/groupby.py:50: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pd.__version__) >= LooseVersion(\"1.3.0\"):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/fs/__init__.py:4: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('fs')`.\n", + "Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages\n", + " __import__(\"pkg_resources\").declare_namespace(__name__) # type: ignore\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/fs/opener/__init__.py:6: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('fs.opener')`.\n", + "Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages\n", + " __import__(\"pkg_resources\").declare_namespace(__name__) # type: ignore\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pkg_resources/__init__.py:2349: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('fs')`.\n", + "Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages\n", + " declare_namespace(parent)\n", + "\n" + ] + } + ], + "source": [ + "import pandas as pd\n", + "from ads.feature_store.feature_store import FeatureStore\n", + "from ads.feature_store.feature_group import FeatureGroup\n", + "from ads.feature_store.model_details import ModelDetails\n", + "from ads.feature_store.dataset import Dataset\n", + "from ads.feature_store.common.enums import DatasetIngestionMode\n", + "\n", + "from ads.feature_store.feature_group_expectation import ExpectationType\n", + "from great_expectations.core import ExpectationSuite, ExpectationConfiguration\n", + "from ads.feature_store.feature_store_registrar import FeatureStoreRegistrar" + ] + }, + { + "cell_type": "markdown", + "id": "c72aef9f", + "metadata": {}, + "source": [ + "\n", + "### 3.1. Exploration of data in feature store\n", + "\n", + "
\n", + " \n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "d5a71a5f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:py.warnings:/tmp/ipykernel_2623/906484602.py:1: DtypeWarning: Columns (7,8) have mixed types. Specify dtype option on import or set low_memory=False.\n", + " flights_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/flights.csv\")[['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']]\n", + "\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
YEARMONTHDAYDAY_OF_WEEKAIRLINEFLIGHT_NUMBERORIGIN_AIRPORTDESTINATION_AIRPORT
02015114AS98ANCSEA
12015114AA2336LAXPBI
22015114US840SFOCLT
32015114AA258LAXMIA
42015114AS135SEAANC
\n", + "
" + ], + "text/plain": [ + " YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT \\\n", + "0 2015 1 1 4 AS 98 ANC \n", + "1 2015 1 1 4 AA 2336 LAX \n", + "2 2015 1 1 4 US 840 SFO \n", + "3 2015 1 1 4 AA 258 LAX \n", + "4 2015 1 1 4 AS 135 SEA \n", + "\n", + " DESTINATION_AIRPORT \n", + "0 SEA \n", + "1 PBI \n", + "2 CLT \n", + "3 MIA \n", + "4 ANC " + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "flights_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/flights.csv\")[['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']]\n", + "flights_df = flights_df.head(100)\n", + "flights_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "5f26aa4e", + "metadata": { + "pycharm": { + "is_executing": true + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IATA_CODEAIRPORTCITYSTATELATITUDELONGITUDE
0ABELehigh Valley International AirportAllentownPA40.65236-75.44040
1ABIAbilene Regional AirportAbileneTX32.41132-99.68190
2ABQAlbuquerque International SunportAlbuquerqueNM35.04022-106.60919
3ABRAberdeen Regional AirportAberdeenSD45.44906-98.42183
4ABYSouthwest Georgia Regional AirportAlbanyGA31.53552-84.19447
\n", + "
" + ], + "text/plain": [ + " IATA_CODE AIRPORT CITY STATE LATITUDE \\\n", + "0 ABE Lehigh Valley International Airport Allentown PA 40.65236 \n", + "1 ABI Abilene Regional Airport Abilene TX 32.41132 \n", + "2 ABQ Albuquerque International Sunport Albuquerque NM 35.04022 \n", + "3 ABR Aberdeen Regional Airport Aberdeen SD 45.44906 \n", + "4 ABY Southwest Georgia Regional Airport Albany GA 31.53552 \n", + "\n", + " LONGITUDE \n", + "0 -75.44040 \n", + "1 -99.68190 \n", + "2 -106.60919 \n", + "3 -98.42183 \n", + "4 -84.19447 " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "columns = ['IATA_CODE', 'AIRPORT', 'CITY', 'STATE', 'LATITUDE', 'LONGITUDE']\n", + "airports_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airports.csv\")[columns]\n", + "airports_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "fdab3e5c", + "metadata": { + "pycharm": { + "is_executing": true + } + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IATA_CODEAIRLINE
0UAUnited Air Lines Inc.
1AAAmerican Airlines Inc.
2USUS Airways Inc.
3F9Frontier Airlines Inc.
4B6JetBlue Airways
\n", + "
" + ], + "text/plain": [ + " IATA_CODE AIRLINE\n", + "0 UA United Air Lines Inc.\n", + "1 AA American Airlines Inc.\n", + "2 US US Airways Inc.\n", + "3 F9 Frontier Airlines Inc.\n", + "4 B6 JetBlue Airways" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "airlines_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airlines.csv\")\n", + "airlines_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "9fd09cb0", + "metadata": {}, + "source": [ + "\n", + "### 3.2. Create feature store logical entities" + ] + }, + { + "cell_type": "markdown", + "id": "2fce1574", + "metadata": {}, + "source": [ + "#### 3.2.1 Feature Store\n", + "Feature store is the top level entity for feature store service" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "4104e209", + "metadata": { + "pycharm": { + "is_executing": true + } + }, + "outputs": [], + "source": [ + "feature_store_resource = (\n", + " FeatureStore().\n", + " with_description(\"Data consisting of flights\").\n", + " with_compartment_id(compartment_id).\n", + " with_display_name(\"flights details\").\n", + " with_offline_config(metastore_id=metastore_id)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "adeb7bb8", + "metadata": {}, + "source": [ + "\n", + "##### Create Feature Store\n", + "\n", + "Call the ```.create()``` method of the Feature store instance to create a feature store." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "a6f2e337", + "metadata": { + "pycharm": { + "is_executing": true + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "kind: featurestore\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\n", + " description: Data consisting of flights\n", + " displayName: flights details\n", + " id: EA128EDAE4380286A842064AF466A685\n", + " offlineConfig:\n", + " metastoreId: ocid1.datacatalogmetastore.oc1.iad.amaaaaaabiudgxyap7tizm4gscwz7amu7dixz7ml3mtesqzzwwg3urvvdgua\n", + "type: featureStore" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_store = feature_store_resource.create()\n", + "feature_store" + ] + }, + { + "cell_type": "markdown", + "id": "d28d15e1", + "metadata": {}, + "source": [ + "#### 3.2.2 Entity\n", + "An entity is a group of semantically related features." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "ee0f393e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "kind: entity\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\n", + " description: description for flight details\n", + " featureStoreId: EA128EDAE4380286A842064AF466A685\n", + " id: 55EB4FC9F3D8AEE40442046F7B7EE92C\n", + " name: Flight details schema evolution/enforcement\n", + "type: entity" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "entity = feature_store.create_entity(\n", + " display_name=\"Flight details schema evolution/enforcement\",\n", + " description=\"description for flight details\"\n", + ")\n", + "entity" + ] + }, + { + "cell_type": "markdown", + "id": "6998d51a", + "metadata": {}, + "source": [ + "\n", + "#### 3.2.3 Feature Group\n", + "\n", + "Create feature group for airport" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "5ca56127", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{\"meta\": {}, \"expectation_type\": \"expect_column_values_to_be_between\", \"kwargs\": {\"column\": \"LONGITUDE\", \"min_value\": -1.0, \"max_value\": 1.0}}" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from great_expectations.core import ExpectationSuite, ExpectationConfiguration\n", + "\n", + "expectation_suite_airports = ExpectationSuite(\n", + " expectation_suite_name=\"test_airports_df\"\n", + ")\n", + "expectation_suite_airports.add_expectation(\n", + " ExpectationConfiguration(\n", + " expectation_type=\"expect_column_values_to_not_be_null\",\n", + " kwargs={\"column\": \"IATA_CODE\"},\n", + " )\n", + ")\n", + "expectation_suite_airports.add_expectation(\n", + " ExpectationConfiguration(\n", + " expectation_type=\"expect_column_values_to_be_between\",\n", + " kwargs={\"column\": \"LATITUDE\", \"min_value\": -1.0, \"max_value\": 1.0},\n", + " )\n", + ")\n", + "\n", + "expectation_suite_airports.add_expectation(\n", + " ExpectationConfiguration(\n", + " expectation_type=\"expect_column_values_to_be_between\",\n", + " kwargs={\"column\": \"LONGITUDE\", \"min_value\": -1.0, \"max_value\": 1.0},\n", + " )\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "bb60c0ad", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Setting default log level to \"WARN\".\n", + "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", + "2023/07/25 10:07:54 NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:471: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.\n", + " arrow_data = [[(c, t) for (_, c), t in zip(pdf_slice.iteritems(), arrow_types)]\n", + "\n" + ] + } + ], + "source": [ + "feature_group_airports = (\n", + " FeatureGroup()\n", + " .with_feature_store_id(feature_store.id)\n", + " .with_primary_keys([\"IATA_CODE\"])\n", + " .with_name(\"airport_feature_group\")\n", + " .with_entity_id(entity.id)\n", + " .with_compartment_id(compartment_id)\n", + " .with_schema_details_from_dataframe(airports_df)\n", + " .with_expectation_suite(\n", + " expectation_suite=expectation_suite_airports,\n", + " expectation_type=ExpectationType.LENIENT,\n", + " )\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "37159872", + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "kind: FeatureGroup\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\n", + " entityId: 55EB4FC9F3D8AEE40442046F7B7EE92C\n", + " expectationDetails:\n", + " createRuleDetails:\n", + " - arguments:\n", + " column: IATA_CODE\n", + " levelType: ERROR\n", + " name: Rule-0\n", + " ruleType: expect_column_values_to_not_be_null\n", + " - arguments:\n", + " column: LATITUDE\n", + " max_value: 1.0\n", + " min_value: -1.0\n", + " levelType: ERROR\n", + " name: Rule-1\n", + " ruleType: expect_column_values_to_be_between\n", + " - arguments:\n", + " column: LONGITUDE\n", + " max_value: 1.0\n", + " min_value: -1.0\n", + " levelType: ERROR\n", + " name: Rule-2\n", + " ruleType: expect_column_values_to_be_between\n", + " expectationType: LENIENT\n", + " name: test_airports_df\n", + " validationEngineType: GREAT_EXPECTATIONS\n", + " featureStoreId: EA128EDAE4380286A842064AF466A685\n", + " id: 26DE61A551F8BF29F132FF03B62B3E67\n", + " inputFeatureDetails:\n", + " - featureType: STRING\n", + " name: IATA_CODE\n", + " orderNumber: 1\n", + " - featureType: STRING\n", + " name: AIRPORT\n", + " orderNumber: 2\n", + " - featureType: STRING\n", + " name: CITY\n", + " orderNumber: 3\n", + " - featureType: STRING\n", + " name: STATE\n", + " orderNumber: 4\n", + " - featureType: DOUBLE\n", + " name: LATITUDE\n", + " orderNumber: 5\n", + " - featureType: DOUBLE\n", + " name: LONGITUDE\n", + " orderNumber: 6\n", + " isInferSchema: true\n", + " name: airport_feature_group\n", + " primaryKeys:\n", + " items:\n", + " - name: IATA_CODE\n", + " statisticsConfig:\n", + " isEnabled: true\n", + "type: featureGroup" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_group_airports.create()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "7ac26507", + "metadata": {}, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "%3\n", + "\n", + "\n", + "EA128EDAE4380286A842064AF466A685\n", + "\n", + "flights details\n", + "Feature Store\n", + "EA128EDAE4380286A842064AF466A685\n", + "\n", + "\n", + "55EB4FC9F3D8AEE40442046F7B7EE92C\n", + "\n", + "Flight details schema evolution/enforcement\n", + "Entity\n", + "55EB4FC9F3D8AEE40442046F7B7EE92C\n", + "\n", + "\n", + "EA128EDAE4380286A842064AF466A685->55EB4FC9F3D8AEE40442046F7B7EE92C\n", + "\n", + "\n", + "\n", + "\n", + "26DE61A551F8BF29F132FF03B62B3E67\n", + "\n", + "airport_feature_group\n", + "Feature Group\n", + "26DE61A551F8BF29F132FF03B62B3E67\n", + "\n", + "\n", + "55EB4FC9F3D8AEE40442046F7B7EE92C->26DE61A551F8BF29F132FF03B62B3E67\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "feature_group_airports.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "1a1ddd95", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Hive Session ID = cdd7eb82-a9e8-4f1b-bdad-93400dab3a3a\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:great_expectations.validator.validator:\t3 expectation(s) included in expectation_suite.\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "676d7a993ba94ba2a8fe00292890547b", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Calculating Metrics: 0%| | 0/16 [00:00 (0 + 2) / 2]\r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43f629f0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d43f6e970>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43f6ef70>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d43f6ef30>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43f521b0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d43f52230>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43f52670>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d43f52270>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43f527f0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=38.9812439184953, minimum=13.48345, maximum=71.28545, central_moments=[1.0, 8.909626780690911e-17, 74.01537930806269, 262.87069420949706, 26574.825385423774]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d43f52df0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d43f526b0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43f60bb0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=-98.37896445141065, minimum=-176.64603, maximum=-64.79856, central_moments=[1.0, 0.0, 461.80848194502215, -11904.62460720004, 932401.3978279813]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d43f607f0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d43f600f0>)} sfc map\n", + "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", + "INFO:mlm_insights.builder:Profile Generated Successfully\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 14, 'percentage': 4.3478260869565215}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 308.000234832572\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='TX', estimate=24, lower_bound=24, upper_bound=24), FrequentItemEstimate(value='CA', estimate=22, lower_bound=22, upper_bound=22), FrequentItemEstimate(value='AK', estimate=19, lower_bound=19, upper_bound=19), FrequentItemEstimate(value='FL', estimate=17, lower_bound=17, upper_bound=17), FrequentItemEstimate(value='MI', estimate=15, lower_bound=15, upper_bound=15), FrequentItemEstimate(value='NY', estimate=14, lower_bound=14, upper_bound=14), FrequentItemEstimate(value='CO', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='NC', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='MN', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='WI', estimate=8, lower_bound=8, upper_bound=8)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 268, 'percentage': 83.22981366459628}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['TX']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 54.00000710785499\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: 0.41281856359758584\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 8.603219124726667\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 13.48345\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 9.529050000000005\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 57.802\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'density': [0.003134796238244514, 0.0, 0.015673981191222573, 0.01567398119122257, 0.0031347962382445166, 0.0, 0.025078369905956105, 0.021943573667711602, 0.07210031347962384, 0.07836990595611285, 0.10658307210031348, 0.0658307210031348, 0.09404388714733536, 0.11598746081504707, 0.13479623824451414, 0.07836990595611282, 0.06896551724137934, 0.037617554858934144, 0.0, 0.006269592476489061, 0.0, 0.01253918495297801, 0.01567398119122254, 0.012539184952978122, 0.0, 0.0031347962382445305, 0.0031347962382444194, 0.0, 0.0031347962382445305, 0.006269592476489061]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 74.01537930806269\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'frequency': [1, 0, 5, 5, 1, 0, 8, 7, 23, 25, 34, 21, 30, 37, 43, 25, 22, 12, 0, 2, 0, 4, 5, 4, 0, 1, 1, 0, 1, 2]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 71.28545\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 12435.01681\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 33.64044, 'q2': 39.29761, 'q3': 43.16949}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 38.9812439184953\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.850946460274213\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: -1.199562407919743\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 21.489729685247838\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: -176.64603\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 28.225759999999994\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 111.84747\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'density': [0.006269592476489028, 0.003134796238244515, 0.003134796238244513, 0.003134796238244513, 0.009404388714733543, 0.01567398119122257, 0.006269592476489033, 0.009404388714733543, 0.006269592476489019, 0.0, 0.00940438871473355, 0.012539184952978052, 0.0, 0.012539184952978052, 0.05642633228840126, 0.040752351097178674, 0.05642633228840124, 0.028213166144200663, 0.05015673981191221, 0.03134796238244514, 0.09090909090909094, 0.09090909090909094, 0.08150470219435735, 0.10031347962382442, 0.09404388714733547, 0.08150470219435735, 0.056426332288401215, 0.028213166144200663, 0.01567398119122254, 0.0]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 461.80848194502215\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'frequency': [2, 1, 1, 1, 3, 5, 2, 3, 2, 0, 3, 4, 0, 4, 18, 13, 18, 9, 16, 10, 29, 29, 26, 32, 30, 26, 18, 9, 5, 0]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: -64.79856\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: -31382.88966\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': -111.11764, 'q2': -93.66068, 'q3': -82.89188}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: -98.37896445141065\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.3719894513293207\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 322.0\n", + "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", + "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", + "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", + "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", + "│ 26DE61A551F8BF29F132FF03B62B3E67 │ FEATURE_GROUP │ Succeeded │ None │\n", + "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" + ] + } + ], + "source": [ + "feature_group_airports.materialise(airports_df)" + ] + }, + { + "cell_type": "markdown", + "id": "a8b2d54e", + "metadata": {}, + "source": [ + "\n", + "### 3.3. Schema enforcement\n", + "\n", + "Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table's schema. Like the front desk manager at a busy restaurant that only accepts reservations, it checks to see whether each column in data inserted into the table is on its list of expected columns (in other words, whether each one has a \"reservation\"), and rejects any writes with columns that aren't on the list." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "f6d46a5e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IATA_CODEAIRPORTCITYSTATELATITUDELONGITUDECOUNTRY
0ABELehigh Valley International AirportAllentownPA40.65236-75.44040USA
1ABIAbilene Regional AirportAbileneTX32.41132-99.68190USA
2ABQAlbuquerque International SunportAlbuquerqueNM35.04022-106.60919USA
3ABRAberdeen Regional AirportAberdeenSD45.44906-98.42183USA
4ABYSouthwest Georgia Regional AirportAlbanyGA31.53552-84.19447USA
\n", + "
" + ], + "text/plain": [ + " IATA_CODE AIRPORT CITY STATE LATITUDE \\\n", + "0 ABE Lehigh Valley International Airport Allentown PA 40.65236 \n", + "1 ABI Abilene Regional Airport Abilene TX 32.41132 \n", + "2 ABQ Albuquerque International Sunport Albuquerque NM 35.04022 \n", + "3 ABR Aberdeen Regional Airport Aberdeen SD 45.44906 \n", + "4 ABY Southwest Georgia Regional Airport Albany GA 31.53552 \n", + "\n", + " LONGITUDE COUNTRY \n", + "0 -75.44040 USA \n", + "1 -99.68190 USA \n", + "2 -106.60919 USA \n", + "3 -98.42183 USA \n", + "4 -84.19447 USA " + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "columns = ['IATA_CODE', 'AIRPORT', 'CITY', 'STATE', 'LATITUDE', 'LONGITUDE', 'COUNTRY']\n", + "airports_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airports.csv\")[columns]\n", + "airports_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "39722b5f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:471: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.\n", + " arrow_data = [[(c, t) for (_, c), t in zip(pdf_slice.iteritems(), arrow_types)]\n", + "\n" + ] + }, + { + "data": { + "text/plain": [ + "\n", + "kind: FeatureGroup\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\n", + " entityId: 55EB4FC9F3D8AEE40442046F7B7EE92C\n", + " expectationDetails:\n", + " createRuleDetails:\n", + " - arguments:\n", + " column: IATA_CODE\n", + " levelType: ERROR\n", + " name: Rule-0\n", + " ruleType: expect_column_values_to_not_be_null\n", + " - arguments:\n", + " column: LATITUDE\n", + " max_value: 1.0\n", + " min_value: -1.0\n", + " levelType: ERROR\n", + " name: Rule-1\n", + " ruleType: expect_column_values_to_be_between\n", + " - arguments:\n", + " column: LONGITUDE\n", + " max_value: 1.0\n", + " min_value: -1.0\n", + " levelType: ERROR\n", + " name: Rule-2\n", + " ruleType: expect_column_values_to_be_between\n", + " expectationType: LENIENT\n", + " name: test_airports_df\n", + " validationEngineType: GREAT_EXPECTATIONS\n", + " featureStoreId: EA128EDAE4380286A842064AF466A685\n", + " id: 26DE61A551F8BF29F132FF03B62B3E67\n", + " inputFeatureDetails:\n", + " - featureType: STRING\n", + " name: IATA_CODE\n", + " orderNumber: 1\n", + " - featureType: STRING\n", + " name: AIRPORT\n", + " orderNumber: 2\n", + " - featureType: STRING\n", + " name: CITY\n", + " orderNumber: 3\n", + " - featureType: STRING\n", + " name: STATE\n", + " orderNumber: 4\n", + " - featureType: DOUBLE\n", + " name: LATITUDE\n", + " orderNumber: 5\n", + " - featureType: DOUBLE\n", + " name: LONGITUDE\n", + " orderNumber: 6\n", + " - featureType: STRING\n", + " name: COUNTRY\n", + " orderNumber: 7\n", + " isInferSchema: true\n", + " jobId: 9e11aebd-3ab1-4da3-a6dd-aa90bd1be5f7\n", + " name: airport_feature_group\n", + " outputFeatureDetails:\n", + " items:\n", + " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", + " featureType: STRING\n", + " name: IATA_CODE\n", + " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", + " featureType: STRING\n", + " name: AIRPORT\n", + " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", + " featureType: STRING\n", + " name: CITY\n", + " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", + " featureType: STRING\n", + " name: STATE\n", + " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", + " featureType: DOUBLE\n", + " name: LATITUDE\n", + " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", + " featureType: DOUBLE\n", + " name: LONGITUDE\n", + " primaryKeys:\n", + " items:\n", + " - name: IATA_CODE\n", + " statisticsConfig:\n", + " isEnabled: true\n", + "type: featureGroup" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_group_airports.with_schema_details_from_dataframe(airports_df)\n", + "feature_group_airports.update()" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "3ad0d743", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:great_expectations.validator.validator:\t3 expectation(s) included in expectation_suite.\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "09a87a20c8af48beaf230a58ee2b1609", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Calculating Metrics: 0%| | 0/16 [00:00 with error message: A schema mismatch detected when writing to the Delta table (Table ID: 020a3b36-917b-4fdc-890f-4fa27abdd809).\n", + "To enable schema migration using DataFrameWriter or DataStreamWriter, please set:\n", + "'.option(\"mergeSchema\", \"true\")'.\n", + "For other operations, set the session configuration\n", + "spark.databricks.delta.schema.autoMerge.enabled to \"true\". See the documentation\n", + "specific to the operation for details.\n", + "\n", + "Table schema:\n", + "root\n", + "-- IATA_CODE: string (nullable = true)\n", + "-- AIRPORT: string (nullable = true)\n", + "-- CITY: string (nullable = true)\n", + "-- STATE: string (nullable = true)\n", + "-- LATITUDE: double (nullable = true)\n", + "-- LONGITUDE: double (nullable = true)\n", + "\n", + "\n", + "Data schema:\n", + "root\n", + "-- IATA_CODE: string (nullable = true)\n", + "-- AIRPORT: string (nullable = true)\n", + "-- CITY: string (nullable = true)\n", + "-- STATE: string (nullable = true)\n", + "-- LATITUDE: double (nullable = true)\n", + "-- LONGITUDE: double (nullable = true)\n", + "-- COUNTRY: string (nullable = true)\n", + "\n", + " \n", + "To overwrite your schema or change partitioning, please set:\n", + "'.option(\"overwriteSchema\", \"true\")'.\n", + "\n", + "Note that the schema can't be overwritten when using\n", + "'replaceWhere'.\n", + " \n", + "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", + "╒══════════════════════════════════╤═══════════════╤════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════════════════════╕\n", + "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", + "╞══════════════════════════════════╪═══════════════╪════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════╡\n", + "│ 26DE61A551F8BF29F132FF03B62B3E67 │ FEATURE_GROUP │ Failed │ A schema mismatch detected when writing to the Delta table (Table ID: 020a3b36-917b-4fdc-890f-4fa27abdd809). │\n", + "│ │ │ │ To enable schema migration using DataFrameWriter or DataStreamWriter, please set: │\n", + "│ │ │ │ '.option(\"mergeSchema\", \"true\")'. │\n", + "│ │ │ │ For other operations, set the session configuration │\n", + "│ │ │ │ spark.databricks.delta.schema.autoMerge.enabled to \"true\". See the documentation │\n", + "│ │ │ │ specific to the operation for details. │\n", + "│ │ │ │ │\n", + "│ │ │ │ Table schema: │\n", + "│ │ │ │ root │\n", + "│ │ │ │ -- IATA_CODE: string (nullable = true) │\n", + "│ │ │ │ -- AIRPORT: string (nullable = true) │\n", + "│ │ │ │ -- CITY: string (nullable = true) │\n", + "│ │ │ │ -- STATE: string (nullable = true) │\n", + "│ │ │ │ -- LATITUDE: double (nullable = true) │\n", + "│ │ │ │ -- LONGITUDE: double (nullable = true) │\n", + "│ │ │ │ │\n", + "│ │ │ │ │\n", + "│ │ │ │ Data schema: │\n", + "│ │ │ │ root │\n", + "│ │ │ │ -- IATA_CODE: string (nullable = true) │\n", + "│ │ │ │ -- AIRPORT: string (nullable = true) │\n", + "│ │ │ │ -- CITY: string (nullable = true) │\n", + "│ │ │ │ -- STATE: string (nullable = true) │\n", + "│ │ │ │ -- LATITUDE: double (nullable = true) │\n", + "│ │ │ │ -- LONGITUDE: double (nullable = true) │\n", + "│ │ │ │ -- COUNTRY: string (nullable = true) │\n", + "│ │ │ │ │\n", + "│ │ │ │ │\n", + "│ │ │ │ To overwrite your schema or change partitioning, please set: │\n", + "│ │ │ │ '.option(\"overwriteSchema\", \"true\")'. │\n", + "│ │ │ │ │\n", + "│ │ │ │ Note that the schema can't be overwritten when using │\n", + "│ │ │ │ 'replaceWhere'. │\n", + "╘══════════════════════════════════╧═══════════════╧════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════════════════════╛\n" + ] + } + ], + "source": [ + "feature_group_airports.materialise(airports_df)" + ] + }, + { + "cell_type": "markdown", + "id": "42495d16", + "metadata": {}, + "source": [ + "\n", + "### 3.4. Schema evolution\n", + "\n", + "Schema evolution is a feature that allows users to easily change a table's current schema to accommodate data that is changing over time. Most commonly, it's used when performing an append or overwrite operation, to automatically adapt the schema to include one or more new columns." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "8374d4c3", + "metadata": {}, + "outputs": [], + "source": [ + "from ads.feature_store.feature_option_details import FeatureOptionDetails\n", + "feature_option_details = FeatureOptionDetails().with_feature_option_write_config_details(merge_schema=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "31e59f5b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:great_expectations.validator.validator:\t3 expectation(s) included in expectation_suite.\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "a03758c89f9147d785de310b66f43c6c", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Calculating Metrics: 0%| | 0/16 [00:00), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d440f87b0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d440f8cf0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d440f8df0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d440fef30>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d440fecf0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d440fe7f0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d440fe170>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d440fe7b0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=38.9812439184953, minimum=13.48345, maximum=71.28545, central_moments=[1.0, 8.909626780690911e-17, 74.01537930806269, 262.87069420949706, 26574.825385423774]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d440fed30>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d440fee70>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d4410b170>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=-98.37896445141065, minimum=-176.64603, maximum=-64.79856, central_moments=[1.0, 0.0, 461.80848194502215, -11904.62460720004, 932401.3978279813]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d4410b370>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d4410b270>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d4410b570>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d4410b5b0>)} sfc map\n", + "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", + "INFO:mlm_insights.builder:Profile Generated Successfully\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 14, 'percentage': 4.3478260869565215}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 308.000234832572\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='TX', estimate=24, lower_bound=24, upper_bound=24), FrequentItemEstimate(value='CA', estimate=22, lower_bound=22, upper_bound=22), FrequentItemEstimate(value='AK', estimate=19, lower_bound=19, upper_bound=19), FrequentItemEstimate(value='FL', estimate=17, lower_bound=17, upper_bound=17), FrequentItemEstimate(value='MI', estimate=15, lower_bound=15, upper_bound=15), FrequentItemEstimate(value='NY', estimate=14, lower_bound=14, upper_bound=14), FrequentItemEstimate(value='CO', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='NC', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='MN', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='WI', estimate=8, lower_bound=8, upper_bound=8)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 268, 'percentage': 83.22981366459628}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['TX']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 54.00000710785499\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: 0.41281856359758584\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 8.603219124726667\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 13.48345\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 9.529050000000005\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 57.802\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'density': [0.003134796238244514, 0.0, 0.015673981191222573, 0.01567398119122257, 0.0031347962382445166, 0.0, 0.025078369905956105, 0.021943573667711602, 0.07210031347962384, 0.07836990595611285, 0.10658307210031348, 0.0658307210031348, 0.09404388714733536, 0.11598746081504707, 0.13479623824451414, 0.07836990595611282, 0.06896551724137934, 0.037617554858934144, 0.0, 0.006269592476489061, 0.0, 0.01253918495297801, 0.01567398119122254, 0.012539184952978122, 0.0, 0.0031347962382445305, 0.0031347962382444194, 0.0, 0.0031347962382445305, 0.006269592476489061]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 74.01537930806269\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'frequency': [1, 0, 5, 5, 1, 0, 8, 7, 23, 25, 34, 21, 30, 37, 43, 25, 22, 12, 0, 2, 0, 4, 5, 4, 0, 1, 1, 0, 1, 2]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 71.28545\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 12435.01681\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 33.64044, 'q2': 39.29761, 'q3': 43.16949}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 38.9812439184953\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.850946460274213\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: -1.199562407919743\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 21.489729685247838\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: -176.64603\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 28.225759999999994\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 111.84747\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'density': [0.006269592476489028, 0.003134796238244515, 0.003134796238244513, 0.003134796238244513, 0.009404388714733543, 0.01567398119122257, 0.006269592476489033, 0.009404388714733543, 0.006269592476489019, 0.0, 0.00940438871473355, 0.012539184952978052, 0.0, 0.012539184952978052, 0.05642633228840126, 0.040752351097178674, 0.05642633228840124, 0.028213166144200663, 0.05015673981191221, 0.03134796238244514, 0.09090909090909094, 0.09090909090909094, 0.08150470219435735, 0.10031347962382442, 0.09404388714733547, 0.08150470219435735, 0.056426332288401215, 0.028213166144200663, 0.01567398119122254, 0.0]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 461.80848194502215\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'frequency': [2, 1, 1, 1, 3, 5, 2, 3, 2, 0, 3, 4, 0, 4, 18, 13, 18, 9, 16, 10, 29, 29, 26, 32, 30, 26, 18, 9, 5, 0]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: -64.79856\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: -31382.88966\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': -111.11764, 'q2': -93.66068, 'q3': -82.89188}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: -98.37896445141065\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.3719894513293207\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='USA', estimate=322, lower_bound=322, upper_bound=322)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 321, 'percentage': 99.68944099378882}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['USA']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 322.0\n", + "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", + "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", + "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", + "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", + "│ 26DE61A551F8BF29F132FF03B62B3E67 │ FEATURE_GROUP │ Succeeded │ None │\n", + "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" + ] + } + ], + "source": [ + "feature_group_airports.materialise(\n", + " input_dataframe=airports_df,\n", + " feature_option_details=feature_option_details\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "a4a9d4fb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "\n", + "kind: FeatureGroup\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\n", + " entityId: 55EB4FC9F3D8AEE40442046F7B7EE92C\n", + " expectationDetails:\n", + " createRuleDetails:\n", + " - arguments:\n", + " column: IATA_CODE\n", + " levelType: ERROR\n", + " name: Rule-0\n", + " ruleType: expect_column_values_to_not_be_null\n", + " - arguments:\n", + " column: LATITUDE\n", + " max_value: 1.0\n", + " min_value: -1.0\n", + " levelType: ERROR\n", + " name: Rule-1\n", + " ruleType: expect_column_values_to_be_between\n", + " - arguments:\n", + " column: LONGITUDE\n", + " max_value: 1.0\n", + " min_value: -1.0\n", + " levelType: ERROR\n", + " name: Rule-2\n", + " ruleType: expect_column_values_to_be_between\n", + " expectationType: LENIENT\n", + " name: test_airports_df\n", + " validationEngineType: GREAT_EXPECTATIONS\n", + " featureStoreId: EA128EDAE4380286A842064AF466A685\n", + " id: 26DE61A551F8BF29F132FF03B62B3E67\n", + " inputFeatureDetails:\n", + " - featureType: STRING\n", + " name: IATA_CODE\n", + " orderNumber: 1\n", + " - featureType: STRING\n", + " name: AIRPORT\n", + " orderNumber: 2\n", + " - featureType: STRING\n", + " name: CITY\n", + " orderNumber: 3\n", + " - featureType: STRING\n", + " name: STATE\n", + " orderNumber: 4\n", + " - featureType: DOUBLE\n", + " name: LATITUDE\n", + " orderNumber: 5\n", + " - featureType: DOUBLE\n", + " name: LONGITUDE\n", + " orderNumber: 6\n", + " - featureType: STRING\n", + " name: COUNTRY\n", + " orderNumber: 7\n", + " isInferSchema: true\n", + " jobId: 6e6a6d07-6a8f-4ea4-8508-264054f4dfb5\n", + " name: airport_feature_group\n", + " outputFeatureDetails:\n", + " items:\n", + " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", + " featureType: STRING\n", + " name: IATA_CODE\n", + " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", + " featureType: STRING\n", + " name: AIRPORT\n", + " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", + " featureType: STRING\n", + " name: CITY\n", + " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", + " featureType: STRING\n", + " name: STATE\n", + " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", + " featureType: DOUBLE\n", + " name: LATITUDE\n", + " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", + " featureType: DOUBLE\n", + " name: LONGITUDE\n", + " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", + " featureType: STRING\n", + " name: COUNTRY\n", + " primaryKeys:\n", + " items:\n", + " - name: IATA_CODE\n", + " statisticsConfig:\n", + " isEnabled: true\n", + "type: featureGroup" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_group_airports" + ] + }, + { + "cell_type": "markdown", + "id": "83082aa3", + "metadata": {}, + "source": [ + "\n", + "### 3.5. Ingestion modes" + ] + }, + { + "cell_type": "markdown", + "id": "4a3ea8b7", + "metadata": {}, + "source": [ + "\n", + "#### 3.5.1. Append\n", + "\n", + "In ``append`` mode, new data is added to the existing table. If the table already exists, the new data is appended to it, extending the dataset. This mode is suitable for scenarios where you want to continuously add new records without modifying or deleting existing data. It preserves the existing data and only appends the new data to the end of the table." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "5983a241", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:great_expectations.validator.validator:\t3 expectation(s) included in expectation_suite.\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "af963256ff4d4bf9946faaa2f0229975", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Calculating Metrics: 0%| | 0/16 [00:00), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d440e05f0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d440f5230>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44045070>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d44045a30>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d440450f0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43f629b0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44045370>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43ea99b0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=38.9812439184953, minimum=13.48345, maximum=71.28545, central_moments=[1.0, 8.909626780690911e-17, 74.01537930806269, 262.87069420949706, 26574.825385423774]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d46ce1670>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d43f62ab0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d45082cb0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=-98.37896445141065, minimum=-176.64603, maximum=-64.79856, central_moments=[1.0, 0.0, 461.80848194502215, -11904.62460720004, 932401.3978279813]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d43ead3b0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d43ead730>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43f595b0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d43ead6f0>)} sfc map\n", + "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", + "INFO:mlm_insights.builder:Profile Generated Successfully\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 14, 'percentage': 4.3478260869565215}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 308.000234832572\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='TX', estimate=24, lower_bound=24, upper_bound=24), FrequentItemEstimate(value='CA', estimate=22, lower_bound=22, upper_bound=22), FrequentItemEstimate(value='AK', estimate=19, lower_bound=19, upper_bound=19), FrequentItemEstimate(value='FL', estimate=17, lower_bound=17, upper_bound=17), FrequentItemEstimate(value='MI', estimate=15, lower_bound=15, upper_bound=15), FrequentItemEstimate(value='NY', estimate=14, lower_bound=14, upper_bound=14), FrequentItemEstimate(value='CO', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='NC', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='MN', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='WI', estimate=8, lower_bound=8, upper_bound=8)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 268, 'percentage': 83.22981366459628}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['TX']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 54.00000710785499\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: 0.41281856359758584\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 8.603219124726667\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 13.48345\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 9.606550000000006\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 57.802\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'density': [0.009404388714733543, 0.0, 0.01567398119122257, 0.009404388714733543, 0.0031347962382445166, 0.006269592476489026, 0.025078369905956112, 0.01567398119122257, 0.07210031347962384, 0.07836990595611285, 0.10658307210031348, 0.07210031347962381, 0.08777429467084635, 0.12225705329153613, 0.12852664576802508, 0.07836990595611282, 0.07523510971786829, 0.037617554858934255, 0.0, 0.0, 0.0, 0.01253918495297801, 0.01567398119122254, 0.012539184952978122, 0.0, 0.00940438871473348, 0.0031347962382445305, 0.0, 0.0031347962382445305, 0.0]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 74.01537930806269\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'frequency': [3, 0, 5, 3, 1, 2, 8, 5, 23, 25, 34, 23, 28, 39, 41, 25, 24, 12, 0, 0, 0, 4, 5, 4, 0, 3, 1, 0, 1, 0]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 71.28545\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 12435.01681\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 33.56294, 'q2': 39.29761, 'q3': 43.16949}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 38.9812439184953\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.850946460274213\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: -1.199562407919743\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 21.489729685247838\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: -176.64603\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 28.386920000000003\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 111.84747\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'density': [0.0, 0.003134796238244514, 0.003134796238244514, 0.009404388714733541, 0.009404388714733543, 0.01567398119122257, 0.006269592476489033, 0.0031347962382445096, 0.006269592476489033, 0.006269592476489019, 0.00940438871473355, 0.006269592476489033, 0.0, 0.018808777429467072, 0.05642633228840126, 0.040752351097178674, 0.05015673981191224, 0.03448275862068967, 0.043887147335423204, 0.037617554858934144, 0.09090909090909094, 0.08463949843260188, 0.08777429467084641, 0.10031347962382442, 0.09404388714733547, 0.08150470219435735, 0.056426332288401215, 0.028213166144200663, 0.00940438871473348, 0.006269592476489061]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 461.80848194502215\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'frequency': [0, 1, 1, 3, 3, 5, 2, 1, 2, 2, 3, 2, 0, 6, 18, 13, 16, 11, 14, 12, 29, 27, 28, 32, 30, 26, 18, 9, 3, 2]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: -64.79856\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: -31382.88966\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': -110.94103, 'q2': -93.40307, 'q3': -82.55411}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: -98.37896445141065\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.3719894513293207\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='USA', estimate=322, lower_bound=322, upper_bound=322)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 321, 'percentage': 99.68944099378882}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['USA']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 322.0\n", + "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", + "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", + "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", + "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", + "│ 26DE61A551F8BF29F132FF03B62B3E67 │ FEATURE_GROUP │ Succeeded │ None │\n", + "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" + ] + } + ], + "source": [ + "from ads.feature_store.feature_group_job import IngestionMode\n", + "feature_group_airports.materialise(airports_df, ingestion_mode=IngestionMode.APPEND)" + ] + }, + { + "cell_type": "markdown", + "id": "443bb29e", + "metadata": {}, + "source": [ + "\n", + "#### 3.5.2. Overwrite\n", + "In ``overwrite`` mode, the existing table is replaced entirely with the new data being saved. If the table already exists, it will be dropped and a new table will be created with the new data. This mode is useful when you want to completely refresh the data in the table with the latest data, discarding any previous records." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "0946e237", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:great_expectations.validator.validator:\t3 expectation(s) included in expectation_suite.\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "56d101b3aaf54a3c9b4d22624375673b", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Calculating Metrics: 0%| | 0/16 [00:00), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44159a30>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d441301b0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d441305f0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d44130830>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44130270>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d442177f0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44217cf0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d44217770>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=38.9812439184953, minimum=13.48345, maximum=71.28545, central_moments=[1.0, 8.909626780690911e-17, 74.01537930806269, 262.87069420949706, 26574.825385423774]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44217630>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d442176f0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d44217ab0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=-98.37896445141065, minimum=-176.64603, maximum=-64.79856, central_moments=[1.0, 0.0, 461.80848194502215, -11904.62460720004, 932401.3978279813]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44217db0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d44217030>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d44141bb0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44141eb0>)} sfc map\n", + "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", + "INFO:mlm_insights.builder:Profile Generated Successfully\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 14, 'percentage': 4.3478260869565215}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 308.000234832572\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='TX', estimate=24, lower_bound=24, upper_bound=24), FrequentItemEstimate(value='CA', estimate=22, lower_bound=22, upper_bound=22), FrequentItemEstimate(value='AK', estimate=19, lower_bound=19, upper_bound=19), FrequentItemEstimate(value='FL', estimate=17, lower_bound=17, upper_bound=17), FrequentItemEstimate(value='MI', estimate=15, lower_bound=15, upper_bound=15), FrequentItemEstimate(value='NY', estimate=14, lower_bound=14, upper_bound=14), FrequentItemEstimate(value='CO', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='NC', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='MN', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='WI', estimate=8, lower_bound=8, upper_bound=8)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 268, 'percentage': 83.22981366459628}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['TX']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 54.00000710785499\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: 0.41281856359758584\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 8.603219124726667\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 13.48345\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 9.529050000000005\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 57.802\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'density': [0.003134796238244514, 0.0, 0.015673981191222573, 0.01567398119122257, 0.0031347962382445166, 0.0, 0.025078369905956105, 0.021943573667711602, 0.07210031347962384, 0.07836990595611285, 0.10658307210031348, 0.0658307210031348, 0.09404388714733536, 0.11598746081504707, 0.13479623824451414, 0.07836990595611282, 0.06896551724137934, 0.037617554858934144, 0.0, 0.006269592476489061, 0.0, 0.01253918495297801, 0.01567398119122254, 0.012539184952978122, 0.0, 0.0031347962382445305, 0.0031347962382444194, 0.0, 0.0031347962382445305, 0.006269592476489061]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 74.01537930806269\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'frequency': [1, 0, 5, 5, 1, 0, 8, 7, 23, 25, 34, 21, 30, 37, 43, 25, 22, 12, 0, 2, 0, 4, 5, 4, 0, 1, 1, 0, 1, 2]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 71.28545\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 12435.01681\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 33.64044, 'q2': 39.29761, 'q3': 43.16949}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 38.9812439184953\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.850946460274213\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: -1.199562407919743\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 21.489729685247838\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: -176.64603\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 28.225759999999994\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 111.84747\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'density': [0.006269592476489028, 0.003134796238244515, 0.003134796238244513, 0.003134796238244513, 0.009404388714733543, 0.01567398119122257, 0.006269592476489033, 0.009404388714733543, 0.006269592476489019, 0.0, 0.00940438871473355, 0.012539184952978052, 0.0, 0.012539184952978052, 0.05642633228840126, 0.040752351097178674, 0.05642633228840124, 0.028213166144200663, 0.05015673981191221, 0.03134796238244514, 0.09090909090909094, 0.09090909090909094, 0.08150470219435735, 0.10031347962382442, 0.09404388714733547, 0.08150470219435735, 0.056426332288401215, 0.028213166144200663, 0.01567398119122254, 0.0]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 461.80848194502215\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'frequency': [2, 1, 1, 1, 3, 5, 2, 3, 2, 0, 3, 4, 0, 4, 18, 13, 18, 9, 16, 10, 29, 29, 26, 32, 30, 26, 18, 9, 5, 0]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: -64.79856\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: -31382.88966\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': -111.11764, 'q2': -93.66068, 'q3': -82.89188}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: -98.37896445141065\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.3719894513293207\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='USA', estimate=322, lower_bound=322, upper_bound=322)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 321, 'percentage': 99.68944099378882}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['USA']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 322.0\n", + "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", + "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", + "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", + "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", + "│ 26DE61A551F8BF29F132FF03B62B3E67 │ FEATURE_GROUP │ Succeeded │ None │\n", + "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" + ] + } + ], + "source": [ + "from ads.feature_store.feature_group_job import IngestionMode\n", + "feature_group_airports.materialise(airports_df, ingestion_mode=IngestionMode.OVERWRITE)" + ] + }, + { + "cell_type": "markdown", + "id": "818940f3", + "metadata": {}, + "source": [ + "\n", + "#### 3.5.3. Upsert\n", + "``Upsert`` mode, also known as ``merge`` mode, is used to update existing records in the table based on a primary key or a specified condition. If a record with the same key exists, it will be updated with the new data; otherwise, a new record will be inserted. This mode is useful for maintaining and synchronizing data between the source and destination tables while avoiding duplicates." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "f6cd567a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:great_expectations.validator.validator:\t3 expectation(s) included in expectation_suite.\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "0845dd8e0c53455abd4f484ad7661c90", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Calculating Metrics: 0%| | 0/16 [00:00), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d442171f0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d4404bd30>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44111cf0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d440f8670>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44111fb0>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d44007230>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44038f70>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d440d9530>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=38.9812439184953, minimum=13.48345, maximum=71.28545, central_moments=[1.0, 8.909626780690911e-17, 74.01537930806269, 262.87069420949706, 26574.825385423774]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44159a70>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d4404e170>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d44141230>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=-98.37896445141065, minimum=-176.64603, maximum=-64.79856, central_moments=[1.0, 0.0, 461.80848194502215, -11904.62460720004, 932401.3978279813]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44141a70>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d44141630>)} sfc map\n", + "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d45cafc70>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d442037b0>)} sfc map\n", + "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", + "INFO:mlm_insights.builder:Profile Generated Successfully\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 14, 'percentage': 4.3478260869565215}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 308.000234832572\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='TX', estimate=24, lower_bound=24, upper_bound=24), FrequentItemEstimate(value='CA', estimate=22, lower_bound=22, upper_bound=22), FrequentItemEstimate(value='AK', estimate=19, lower_bound=19, upper_bound=19), FrequentItemEstimate(value='FL', estimate=17, lower_bound=17, upper_bound=17), FrequentItemEstimate(value='MI', estimate=15, lower_bound=15, upper_bound=15), FrequentItemEstimate(value='NY', estimate=14, lower_bound=14, upper_bound=14), FrequentItemEstimate(value='CO', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='NC', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='MN', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='WI', estimate=8, lower_bound=8, upper_bound=8)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 268, 'percentage': 83.22981366459628}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['TX']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 54.00000710785499\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: 0.41281856359758584\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 8.603219124726667\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 13.48345\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 9.529050000000005\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 57.802\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'density': [0.003134796238244514, 0.0, 0.015673981191222573, 0.01567398119122257, 0.0031347962382445166, 0.0, 0.025078369905956105, 0.021943573667711602, 0.07210031347962384, 0.07836990595611285, 0.10658307210031348, 0.0658307210031348, 0.09404388714733536, 0.11598746081504707, 0.13479623824451414, 0.07836990595611282, 0.06896551724137934, 0.037617554858934144, 0.0, 0.006269592476489061, 0.0, 0.01253918495297801, 0.01567398119122254, 0.012539184952978122, 0.0, 0.0031347962382445305, 0.0031347962382444194, 0.0, 0.0031347962382445305, 0.006269592476489061]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 74.01537930806269\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'frequency': [1, 0, 5, 5, 1, 0, 8, 7, 23, 25, 34, 21, 30, 37, 43, 25, 22, 12, 0, 2, 0, 4, 5, 4, 0, 1, 1, 0, 1, 2]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 71.28545\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 12435.01681\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 33.64044, 'q2': 39.29761, 'q3': 43.16949}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 38.9812439184953\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.850946460274213\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: -1.199562407919743\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 21.489729685247838\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Min metric, value: -176.64603\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 28.225759999999994\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 111.84747\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'density': [0.006269592476489028, 0.003134796238244515, 0.003134796238244513, 0.003134796238244513, 0.009404388714733543, 0.01567398119122257, 0.006269592476489033, 0.009404388714733543, 0.006269592476489019, 0.0, 0.00940438871473355, 0.012539184952978052, 0.0, 0.012539184952978052, 0.05642633228840126, 0.040752351097178674, 0.05642633228840124, 0.028213166144200663, 0.05015673981191221, 0.03134796238244514, 0.09090909090909094, 0.09090909090909094, 0.08150470219435735, 0.10031347962382442, 0.09404388714733547, 0.08150470219435735, 0.056426332288401215, 0.028213166144200663, 0.01567398119122254, 0.0]}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 461.80848194502215\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'frequency': [2, 1, 1, 1, 3, 5, 2, 3, 2, 0, 3, 4, 0, 4, 18, 13, 18, 9, 16, 10, 29, 29, 26, 32, 30, 26, 18, 9, 5, 0]}\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Max metric, value: -64.79856\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", + "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: -31382.88966\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", + "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", + "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': -111.11764, 'q2': -93.66068, 'q3': -82.89188}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: -98.37896445141065\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.3719894513293207\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='USA', estimate=322, lower_bound=322, upper_bound=322)]\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 321, 'percentage': 99.68944099378882}\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", + "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", + "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['USA']\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", + "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", + "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", + "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 322.0\n", + "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", + "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", + "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", + "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", + "│ 26DE61A551F8BF29F132FF03B62B3E67 │ FEATURE_GROUP │ Succeeded │ None │\n", + "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" + ] + } + ], + "source": [ + "from ads.feature_store.feature_group_job import IngestionMode\n", + "feature_group_airports.materialise(airports_df, ingestion_mode=IngestionMode.UPSERT)" + ] + }, + { + "cell_type": "markdown", + "id": "edad9b57", + "metadata": {}, + "source": [ + "\n", + "### 3.6. History\n", + "You can call the ``history()`` method of the FeatureGroup instance to show history of the feature group." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "3e909d02", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/types.py:63: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pa.__version__) < LooseVersion(\"2.0.0\"):\n", + "\n", + "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", + " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", + "\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
versiontimestampuserIduserNameoperationoperationParametersjobnotebookclusterIdreadVersionisolationLevelisBlindAppendoperationMetricsuserMetadataengineInfo
042023-07-25 10:11:15NoneNoneMERGE{'predicate': '(target_delta_table.IATA_CODE = source_delta_table.IATA_CODE)', 'matchedPredicates': '[{\"actionType\":\"update\"}]', 'notMatchedPredicates': '[{\"actionType\":\"insert\"}]'}NoneNoneNone3.0SerializableFalse{'numTargetRowsCopied': '0', 'numTargetRowsDeleted': '0', 'numTargetFilesAdded': '1', 'executionTimeMs': '5340', 'numTargetRowsInserted': '0', 'scanTimeMs': '2694', 'numTargetRowsUpdated': '322', 'numOutputRows': '322', 'numTargetChangeFilesAdded': '0', 'numSourceRows': '322', 'numTargetFilesRemoved': '2', 'rewriteTimeMs': '2443'}NoneApache-Spark/3.2.1 Delta-Lake/2.0.1
132023-07-25 10:10:51NoneNoneCREATE OR REPLACE TABLE AS SELECT{'isManaged': 'true', 'description': None, 'partitionBy': '[]', 'properties': '{}'}NoneNoneNone2.0SerializableFalse{'numFiles': '2', 'numOutputRows': '322', 'numOutputBytes': '20732'}NoneApache-Spark/3.2.1 Delta-Lake/2.0.1
222023-07-25 10:10:28NoneNoneWRITE{'mode': 'Append', 'partitionBy': '[]'}NoneNoneNone1.0SerializableTrue{'numFiles': '2', 'numOutputRows': '322', 'numOutputBytes': '20732'}NoneApache-Spark/3.2.1 Delta-Lake/2.0.1
312023-07-25 10:10:13NoneNoneCREATE OR REPLACE TABLE AS SELECT{'isManaged': 'true', 'description': None, 'partitionBy': '[]', 'properties': '{}'}NoneNoneNone0.0SerializableFalse{'numFiles': '2', 'numOutputRows': '322', 'numOutputBytes': '20732'}NoneApache-Spark/3.2.1 Delta-Lake/2.0.1
402023-07-25 10:09:13NoneNoneCREATE OR REPLACE TABLE AS SELECT{'isManaged': 'true', 'description': None, 'partitionBy': '[]', 'properties': '{}'}NoneNoneNoneNaNSerializableFalse{'numFiles': '2', 'numOutputRows': '322', 'numOutputBytes': '20174'}NoneApache-Spark/3.2.1 Delta-Lake/2.0.1
\n", + "
" + ], + "text/plain": [ + " version timestamp userId userName \\\n", + "0 4 2023-07-25 10:11:15 None None \n", + "1 3 2023-07-25 10:10:51 None None \n", + "2 2 2023-07-25 10:10:28 None None \n", + "3 1 2023-07-25 10:10:13 None None \n", + "4 0 2023-07-25 10:09:13 None None \n", + "\n", + " operation \\\n", + "0 MERGE \n", + "1 CREATE OR REPLACE TABLE AS SELECT \n", + "2 WRITE \n", + "3 CREATE OR REPLACE TABLE AS SELECT \n", + "4 CREATE OR REPLACE TABLE AS SELECT \n", + "\n", + " operationParameters \\\n", + "0 {'predicate': '(target_delta_table.IATA_CODE = source_delta_table.IATA_CODE)', 'matchedPredicates': '[{\"actionType\":\"update\"}]', 'notMatchedPredicates': '[{\"actionType\":\"insert\"}]'} \n", + "1 {'isManaged': 'true', 'description': None, 'partitionBy': '[]', 'properties': '{}'} \n", + "2 {'mode': 'Append', 'partitionBy': '[]'} \n", + "3 {'isManaged': 'true', 'description': None, 'partitionBy': '[]', 'properties': '{}'} \n", + "4 {'isManaged': 'true', 'description': None, 'partitionBy': '[]', 'properties': '{}'} \n", + "\n", + " job notebook clusterId readVersion isolationLevel isBlindAppend \\\n", + "0 None None None 3.0 Serializable False \n", + "1 None None None 2.0 Serializable False \n", + "2 None None None 1.0 Serializable True \n", + "3 None None None 0.0 Serializable False \n", + "4 None None None NaN Serializable False \n", + "\n", + " operationMetrics \\\n", + "0 {'numTargetRowsCopied': '0', 'numTargetRowsDeleted': '0', 'numTargetFilesAdded': '1', 'executionTimeMs': '5340', 'numTargetRowsInserted': '0', 'scanTimeMs': '2694', 'numTargetRowsUpdated': '322', 'numOutputRows': '322', 'numTargetChangeFilesAdded': '0', 'numSourceRows': '322', 'numTargetFilesRemoved': '2', 'rewriteTimeMs': '2443'} \n", + "1 {'numFiles': '2', 'numOutputRows': '322', 'numOutputBytes': '20732'} \n", + "2 {'numFiles': '2', 'numOutputRows': '322', 'numOutputBytes': '20732'} \n", + "3 {'numFiles': '2', 'numOutputRows': '322', 'numOutputBytes': '20732'} \n", + "4 {'numFiles': '2', 'numOutputRows': '322', 'numOutputBytes': '20174'} \n", + "\n", + " userMetadata engineInfo \n", + "0 None Apache-Spark/3.2.1 Delta-Lake/2.0.1 \n", + "1 None Apache-Spark/3.2.1 Delta-Lake/2.0.1 \n", + "2 None Apache-Spark/3.2.1 Delta-Lake/2.0.1 \n", + "3 None Apache-Spark/3.2.1 Delta-Lake/2.0.1 \n", + "4 None Apache-Spark/3.2.1 Delta-Lake/2.0.1 " + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "feature_group_airports.history().toPandas()" + ] + }, + { + "cell_type": "markdown", + "id": "dd5de81e", + "metadata": {}, + "source": [ + "\n", + "### 3.6. Preview\n", + "\n", + "You can call the ``preview()`` method of the FeatureGroup instance to preview the feature group.\n", + "\n", + "The ``.preview()`` method takes the following optional parameter:\n", + "\n", + "- timestamp: date-time. Commit timestamp for feature group\n", + "- version_number: int. Version number for feature group\n", + "- row_count: int. Defaults to 10. Total number of row to return" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "d706e9da", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+---------+--------------------+-------------+-----+--------+----------+-------+\n", + "|IATA_CODE| AIRPORT| CITY|STATE|LATITUDE| LONGITUDE|COUNTRY|\n", + "+---------+--------------------+-------------+-----+--------+----------+-------+\n", + "| ABE|Lehigh Valley Int...| Allentown| PA|40.65236| -75.4404| USA|\n", + "| ABI|Abilene Regional ...| Abilene| TX|32.41132| -99.6819| USA|\n", + "| ABQ|Albuquerque Inter...| Albuquerque| NM|35.04022|-106.60919| USA|\n", + "| ABR|Aberdeen Regional...| Aberdeen| SD|45.44906| -98.42183| USA|\n", + "| ABY|Southwest Georgia...| Albany| GA|31.53552| -84.19447| USA|\n", + "| ACK|Nantucket Memoria...| Nantucket| MA|41.25305| -70.06018| USA|\n", + "| ACT|Waco Regional Air...| Waco| TX|31.61129| -97.23052| USA|\n", + "| ACV| Arcata Airport|Arcata/Eureka| CA|40.97812|-124.10862| USA|\n", + "| ACY|Atlantic City Int...|Atlantic City| NJ|39.45758| -74.57717| USA|\n", + "| ADK| Adak Airport| Adak| AK|51.87796|-176.64603| USA|\n", + "+---------+--------------------+-------------+-----+--------+----------+-------+\n", + "\n" + ] + } + ], + "source": [ + "feature_group_airports.preview().show()" + ] + }, + { + "cell_type": "markdown", + "id": "9881408a", + "metadata": {}, + "source": [ + "\n", + "# References\n", + "\n", + "- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)\n", + "- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)\n", + "- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)\n", + "- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "972dfb03", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [conda env:fspyspark32_p38_cpu_v1]", + "language": "python", + "name": "conda-env-fspyspark32_p38_cpu_v1-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.17" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebook_examples/feature_store_spark_magic.ipynb b/notebook_examples/feature_store_spark_magic.ipynb new file mode 100644 index 00000000..d4d19b6b --- /dev/null +++ b/notebook_examples/feature_store_spark_magic.ipynb @@ -0,0 +1,1017 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "f10693dc", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "Oracle Data Science service sample notebook.\n", + "\n", + "Copyright (c) 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).\n", + "\n", + "***\n", + "\n", + "# Data Flow Studio: Big Data Operations in Feature Store\n", + "

by the Oracle Cloud Infrastructure Data Science Service.

\n", + "\n", + "---\n", + "# Overview:\n", + "\n", + "This notebook demonstrates how to run interactive Spark workloads on a long lasting [Oracle Cloud Infrastructure Data Flow](https://docs.oracle.com/en-us/iaas/data-flow/using/home.htm) cluster through [Apache Livy](https://livy.apache.org/) integration. **Data Flow Spark Magic** is used for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. It includes a set of magic commands for interactively running Spark code.\n", + "\n", + "\n", + "\n", + "## Contents:\n", + "\n", + "- 1. Introduction\n", + "- 1. Pre-requisites\n", + " - 2.1 Policies\n", + " - 2.2 Prerequisites Helpers\n", + " - 2.3 Authentication\n", + " - 2.4 Variables\n", + "- 3. Dataflow Magic\n", + " - 3.1. Load extension\n", + " - 3.2. Load feature groups\n", + " - 3.3. Data exploration\n", + " - 3.4. Creation of logical entities of feature group\n", + " - 3.4.1 Creation of feature store\n", + " - 3.4.2 Creation of entity\n", + " - 3.4.3 Creation of feature group\n", + " - 3.4.4 Materialisation of feature group\n", + " - 3.4.5 Querying of feature group\n", + "- 4. References\n", + "\n", + "---\n", + "\n", + "\n", + "Compatible conda pack: [PySpark 3.2 and Data Flow](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8\n", + "\n", + "\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "id": "f616fcc9", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "# 1. Introduction\n", + "\n", + "Oracle feature store is a stack based solution that is deployed in the customer enclave using OCI resource manager. Customer can stand up the service with infrastructure in their own tenancy. The service consists of API which are deployed in customer tenancy using resource manager.\n", + "\n", + "The following are some key terms that will help you understand OCI Data Science Feature Store:\n", + "\n", + "\n", + "* **Feature Vector**: Set of feature values for any one primary/identifier key. Eg. All/subset of features of customer id ‘2536’ can be called as one feature vector.\n", + "\n", + "* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.\n", + "\n", + "* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Another way to look at it is that an entity is an object or concept that is described by its features. Examples of entities could be customer, product, transaction, review, image, document, etc.\n", + "\n", + "* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in ml models. It serves as an organizational unit within the feature store for users to manage, version and share features across different ml projects. By organizing features into groups, data scientists and ml engineers can efficiently discover, reuse and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.\n", + "\n", + "* **Feature Group Job**: Feature group job is the execution instance of a feature group. Each feature group job will include validation results and statistics results.\n", + "\n", + "* **Dataset**: A dataset is a collection of feature that are used together to either train a model or perform model inference.\n", + "\n", + "* **Dataset Job**: Dataset job is the execution instance of a dataset. Each dataset job will include validation results and statistics results." + ] + }, + { + "cell_type": "markdown", + "id": "ca672c4d", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "# 2. Pre-requisites \n", + "\n", + "Data Flow Sessions are accessible through the following conda environment: \n", + "\n", + "* **PySpark 3.2 and Feature Store (pyspark_3_v1)**" + ] + }, + { + "cell_type": "markdown", + "id": "33daeebe", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "## 2.1. Policies\n", + "This section covers the creation of dynamic groups and policies needed to use the service.\n", + "\n", + "* [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm/)\n", + "* [Getting Started with Data Flow](https://docs.oracle.com/iaas/data-flow/using/dfs_getting_started.htm)\n", + "* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)\n", + "* [Data Catalog Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm)" + ] + }, + { + "cell_type": "markdown", + "id": "97ed352c", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "## 2.2 Helpers\n", + "This section provides a helper method used across the notebook to prepare arguments for the magic commands. This function is particularly useful when you want to pass Python variables as arguments to the spark magic commands " + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "6a5e9194", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "import json\n", + "\n", + "\n", + "def prepare_command(command: dict) -> str:\n", + " \"\"\"Converts dictionary command to the string formatted commands.\"\"\"\n", + " return f\"'{json.dumps(command)}'\"" + ] + }, + { + "cell_type": "markdown", + "id": "9c0484c6", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "## 2.3. Authentication\n", + "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the Data Flow Session Spark cluster.
\n", + "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```. " + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "1e6f441d", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "import ads\n", + "\n", + "ads.set_auth(\"resource_principal\") # Supported values: resource_principal, api_key" + ] + }, + { + "cell_type": "markdown", + "id": "86735a35", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "## 2.4. Variables\n", + "To run this notebook, you must provide some information about your tenancy configuration. To connect to the HIVE metastore, replace `` with the OCID for the HIVE metastore. Connecting to the metastore is optional. \n", + "\n", + "To create and run a Data Flow session, you must specify a ``, ``, bucket `` and `` for storing logs. These resources must be in the same compartment." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "276d1aec", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "compartment_id = \"\"\n", + "metastore_id = \"\"\n", + "logs_bucket_uri = \"\"\n", + "\n", + "custom_conda_environment_uri = \"oci://service-conda-packs-fs@bigdatadatasciencelarge/service_pack/cpu/PySpark_3.2_and_Feature_Store/1.0/fspyspark32_p38_cpu_v1#conda\"" + ] + }, + { + "cell_type": "markdown", + "id": "3fbc6c00", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "# 3. Data Flow Spark Magic\n", + "Data Flow Spark Magic commands allow you to interactively work with Data Flow Spark clusters (sessions) in Jupyter notebooks through the Livy REST API. It provides a set of Jupyter Notebook cell magic commands to turn Jupyter into an integrated Spark development environment for remote clusters. \n", + "\n", + "**Data Flow Magic allows you to:**\n", + "\n", + "* Run Spark code against Data Flow remote Spark cluster\n", + "* Create a Data Flow Spark Session with SparkContext and HiveContext against Data Flow remote Spark cluster\n", + "* Capture the output of Spark queries as a local Pandas data frame to interact easily with other Python libraries (e.g. matplotlib)" + ] + }, + { + "cell_type": "markdown", + "id": "591d4492", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.1. Load Spark Magic Commands and Getting Help\n", + "Data Flow Spark Magic is a JupyterLab extension that you need to activate in your notebook using the `%load_ext dataflow.magics` magic command.
\n", + "After the extension is activated, the `%help` command can be used to get the list of supported commands." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "a6e0890f", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "%load_ext dataflow.magics" + ] + }, + { + "cell_type": "markdown", + "id": "b39ac865", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.2. Create Session\n", + "To create a new Data Flow cluster session use the `%create_session` magic command." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "23c6a9e2", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Setting up the Cluster..\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "56a9bba76fb7424ea6a7bc207a085508", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Cluster is ready..\n", + "Starting Spark application..\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
Session IDKindStateCurrent session
ocid1.dataflowapplication.oc1.iad.anuwcljsnif7xwia5uvy54rp5ybm2u2va6sg2azmpmtsw4i7s2wpqy3thj3apysparkIN_PROGRESSDataflow Run
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "SparkSession available as 'spark'.\n", + "SparkContext available as 'sc'.\n" + ] + } + ], + "source": [ + "command = prepare_command(\n", + " {\n", + " \"compartmentId\": compartment_id,\n", + " \"displayName\": \"spark_session_via_notebook\",\n", + " \"language\": \"PYTHON\",\n", + " \"sparkVersion\": \"3.2.1\",\n", + " \"numExecutors\": 8,\n", + " \"metastoreId\": metastore_id,\n", + " \"driverShape\": \"VM.Standard2.1\",\n", + " \"executorShape\": \"VM.Standard2.1\",\n", + " \"driverShapeConfig\": {\"ocpus\": 2, \"memoryInGBs\": 16},\n", + " \"executorShapeConfig\": {\"ocpus\": 2, \"memoryInGBs\": 16},\n", + " \"type\": \"SESSION\",\n", + " \"logsBucketUri\": logs_bucket_uri,\n", + " \"configuration\": {\n", + " \"spark.archives\": custom_conda_environment_uri,\n", + " \"fs.oci.client.hostname\": \"https://objectstorage.us-ashburn-1.oraclecloud.com\"\n", + " },\n", + " }\n", + ")\n", + "\n", + "%create_session -l python -c $command" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "53a7b300", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "%%spark\n", + "from great_expectations.core import ExpectationSuite, ExpectationConfiguration\n", + "\n", + "import ads\n", + "from ads.feature_store.entity import Entity\n", + "from ads.feature_store.feature_group import FeatureGroup\n", + "from ads.feature_store.feature_group_expectation import ExpectationType\n", + "from ads.feature_store.feature_store import FeatureStore\n", + "from ads.feature_store.input_feature_detail import FeatureDetail, FeatureType\n", + "from ads.feature_store.statistics_config import StatisticsConfig\n", + "from ads.feature_store.transformation import TransformationMode\n", + "import os\n", + "\n", + "# Set the Authentications for the feature store operations\n", + "ads.set_auth(auth=\"resource_principal\", client_kwargs={\"service_endpoint\": \"https://pac7vnpvfa2xkagazweggatqwy.apigateway.us-ashburn-1.oci.customer-oci.com/20230101\"})\n", + "\n", + "# Variables\n", + "compartment_id = \"\"\n", + "metastore_id = \"\"" + ] + }, + { + "cell_type": "markdown", + "id": "6824f08f", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.3. Data exploration" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "eabdb503", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+---------+-------------------+-------------------+\n", + "|vendor_id| pickup_at| dropoff_at|\n", + "+---------+-------------------+-------------------+\n", + "| CMT|2011-01-29 02:38:35|2011-01-29 02:47:07|\n", + "| CMT|2011-01-28 10:38:19|2011-01-28 10:42:18|\n", + "| CMT|2011-01-28 23:49:58|2011-01-28 23:57:44|\n", + "| CMT|2011-01-28 23:52:09|2011-01-28 23:59:21|\n", + "| CMT|2011-01-28 10:34:39|2011-01-28 11:25:50|\n", + "| CMT|2011-01-28 23:50:00|2011-01-28 23:58:11|\n", + "| CMT|2011-01-29 02:38:48|2011-01-29 02:50:37|\n", + "| CMT|2011-01-29 02:41:16|2011-01-29 02:45:45|\n", + "| CMT|2011-01-28 23:50:51|2011-01-29 00:07:55|\n", + "| CMT|2011-01-29 02:41:34|2011-01-29 03:08:14|\n", + "| CMT|2011-01-28 23:50:22|2011-01-29 00:03:23|\n", + "| CMT|2011-01-29 02:40:30|2011-01-29 02:43:08|\n", + "| CMT|2011-01-29 02:42:47|2011-01-29 02:50:31|\n", + "| CMT|2011-01-28 23:51:10|2011-01-29 00:03:19|\n", + "| CMT|2011-01-28 05:07:16|2011-01-28 05:12:25|\n", + "| CMT|2011-01-29 02:42:31|2011-01-29 02:55:56|\n", + "| CMT|2011-01-28 23:51:01|2011-01-28 23:59:06|\n", + "| CMT|2011-01-29 02:39:23|2011-01-29 02:59:31|\n", + "| CMT|2011-01-29 02:41:18|2011-01-29 02:50:43|\n", + "| CMT|2011-01-28 10:30:44|2011-01-28 10:48:05|\n", + "+---------+-------------------+-------------------+\n", + "only showing top 20 rows" + ] + } + ], + "source": [ + "%%spark\n", + "df_nyc_tlc = spark.read.parquet(\"oci://hosted-ds-datasets@bigdatadatasciencelarge/nyc_tlc/201[1,2,3,4,5,6,7,8]/**/data.parquet\", header=False, inferSchema=True)\n", + "df_nyc_tlc = df_nyc_tlc.select(\"vendor_id\", \"pickup_at\", \"dropoff_at\")\n", + "\n", + "df_nyc_tlc.show()" + ] + }, + { + "cell_type": "markdown", + "id": "d5e06db4", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "### 3.4. Create feature store logical entities" + ] + }, + { + "cell_type": "markdown", + "id": "c8e0ce2e", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "#### 3.4.1 Creation of Feature Store\n", + "Feature store is the top level entity for feature store service" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "7228e930", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "kind: featurestore\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", + " description: Feature Store Description\n", + " displayName: FeatureStore\n", + " id: 8893420628AB925DBEF259F660862F31\n", + " offlineConfig:\n", + " metastoreId: ocid1.datacatalogmetastore.oc1.iad.amaaaaaanif7xwiaavhd2liaebamr3tbjzio3uw2lxuteoa5ejsfvhqufbsa\n", + "type: featureStore" + ] + } + ], + "source": [ + "%%spark\n", + "feature_store_resource = FeatureStore(). \\\n", + " with_description(\"Feature Store Description\"). \\\n", + " with_compartment_id(compartment_id). \\\n", + " with_display_name(\"FeatureStore\"). \\\n", + " with_offline_config(metastore_id=metastore_id)\n", + "\n", + "feature_store = feature_store_resource.create()\n", + "feature_store" + ] + }, + { + "cell_type": "markdown", + "id": "a805da11", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "#### 3.4.2 Creation of Entity\n", + "An entity is a group of semantically related features." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "84f611d7", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "kind: entity\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", + " featureStoreId: 8893420628AB925DBEF259F660862F31\n", + " id: 5748B756C5CEE21176FCCDFDB64FA08F\n", + " name: entity_resource-sticky-salmon-2023-07-14-05:46.01\n", + "type: entity" + ] + } + ], + "source": [ + "%%spark\n", + "entity = feature_store.create_entity()\n", + "entity" + ] + }, + { + "cell_type": "markdown", + "id": "4ccacb09", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "#### 3.4.3 Creation of Feature group\n", + "A feature group is an object that represents a logical group of time-series feature data as it is found in a datasource." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "d58b0569", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "kind: FeatureGroup\n", + "spec:\n", + " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", + " entityId: 5748B756C5CEE21176FCCDFDB64FA08F\n", + " expectationDetails:\n", + " createRuleDetails:\n", + " - arguments:\n", + " column: vendor_id\n", + " levelType: ERROR\n", + " name: Rule-0\n", + " ruleType: EXPECT_COLUMN_VALUES_TO_NOT_BE_NULL\n", + " expectationType: LENIENT\n", + " name: feature_definition\n", + " validationEngineType: GREAT_EXPECTATIONS\n", + " featureStoreId: 8893420628AB925DBEF259F660862F31\n", + " id: 6BAC94626CABC8944E7C29F5D9C8FC5E\n", + " inputFeatureDetails:\n", + " - featureType: STRING\n", + " name: vendor_id\n", + " orderNumber: 1\n", + " - featureType: TIMESTAMP\n", + " name: pickup_at\n", + " orderNumber: 2\n", + " - featureType: TIMESTAMP\n", + " name: dropoff_at\n", + " orderNumber: 3\n", + " isInferSchema: false\n", + " name: feature_group_big_data\n", + " primaryKeys:\n", + " items:\n", + " - name: vendor_id\n", + " statisticsConfig:\n", + " isEnabled: false\n", + "type: featureGroup" + ] + } + ], + "source": [ + "%%spark\n", + "\n", + "# Initialize Expectation Suite\n", + "expectation_suite_trans = ExpectationSuite(expectation_suite_name=\"feature_definition\")\n", + "expectation_suite_trans.add_expectation(\n", + " ExpectationConfiguration(\n", + " expectation_type=\"EXPECT_COLUMN_VALUES_TO_NOT_BE_NULL\",\n", + " kwargs={\"column\": \"vendor_id\"}\n", + " )\n", + ")\n", + "\n", + "stats_config = StatisticsConfig().with_is_enabled(False)\n", + "\n", + "feature_group = entity.create_feature_group(\n", + " primary_keys=[\"vendor_id\"],\n", + " schema_details_dataframe=df_nyc_tlc, #infer the schema from the data frame\n", + " expectation_suite=expectation_suite_trans,\n", + " expectation_type=ExpectationType.LENIENT,\n", + " statistics_config=stats_config,\n", + " name=\"feature_group_big_data\",\n", + ")\n", + "\n", + "feature_group" + ] + }, + { + "cell_type": "markdown", + "id": "76f62d36", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "#### 3.4.4 Materialisation of Feature group" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "807e843c", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Calculating Metrics: 100%|##########| 8/8 [01:04<00:00, 8.12s/it]" + ] + } + ], + "source": [ + "%%spark\n", + "import pandas as pd\n", + "df_nyc_tlc = spark.read.parquet(\"oci://hosted-ds-datasets@bigdatadatasciencelarge/nyc_tlc/201[1,2,3,4,5,6,7,8]/**/data.parquet\", header=False, inferSchema=True)\n", + "df_nyc_tlc = df_nyc_tlc.select(\"vendor_id\", \"pickup_at\", \"dropoff_at\").limit(1000)\n", + "\n", + "feature_group.materialise(df_nyc_tlc)" + ] + }, + { + "cell_type": "markdown", + "id": "c3fe60de", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "#### 3.4.5 Feature group Querying" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "8363e1ea", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+---------+-------------------+-------------------+\n", + "|vendor_id| pickup_at| dropoff_at|\n", + "+---------+-------------------+-------------------+\n", + "| VTS|2011-02-27 04:00:00|2011-02-27 04:14:00|\n", + "| VTS|2011-02-27 20:38:00|2011-02-27 20:46:00|\n", + "| VTS|2011-02-27 17:47:00|2011-02-27 17:58:00|\n", + "| VTS|2011-02-26 19:56:00|2011-02-26 20:04:00|\n", + "| VTS|2011-02-23 13:05:00|2011-02-23 13:10:00|\n", + "| VTS|2011-02-27 03:48:00|2011-02-27 04:01:00|\n", + "| VTS|2011-02-27 17:52:00|2011-02-27 18:02:00|\n", + "| VTS|2011-02-27 00:44:00|2011-02-27 01:04:00|\n", + "| VTS|2011-02-27 04:08:00|2011-02-27 04:22:00|\n", + "| VTS|2011-02-27 11:53:00|2011-02-27 12:05:00|\n", + "+---------+-------------------+-------------------+\n", + "only showing top 10 rows" + ] + } + ], + "source": [ + "%%spark\n", + "feature_group.select().show()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "a992899c", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+---------+-------------------+\n", + "|vendor_id| pickup_at|\n", + "+---------+-------------------+\n", + "| VTS|2011-02-27 04:00:00|\n", + "| VTS|2011-02-27 20:38:00|\n", + "| VTS|2011-02-27 17:47:00|\n", + "| VTS|2011-02-26 19:56:00|\n", + "| VTS|2011-02-23 13:05:00|\n", + "| VTS|2011-02-27 03:48:00|\n", + "| VTS|2011-02-27 17:52:00|\n", + "| VTS|2011-02-27 00:44:00|\n", + "| VTS|2011-02-27 04:08:00|\n", + "| VTS|2011-02-27 11:53:00|\n", + "+---------+-------------------+\n", + "only showing top 10 rows" + ] + } + ], + "source": [ + "%%spark\n", + "feature_group.select([\"vendor_id\", \"pickup_at\"]).show()" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "aaf454e6", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+---------+---------+----------+\n", + "|vendor_id|pickup_at|dropoff_at|\n", + "+---------+---------+----------+\n", + "+---------+---------+----------+" + ] + } + ], + "source": [ + "%%spark\n", + "feature_group.filter(feature_group.vendor_id == \"CMT\").show()" + ] + }, + { + "cell_type": "markdown", + "id": "5bd5b8e6", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "\n", + "# References\n", + "\n", + "- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)\n", + "- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)\n", + "- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)\n", + "- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dd4d0acc", + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [conda env:fspyspark32_p38_cpu#conda_v1]", + "language": "python", + "name": "conda-env-fspyspark32_p38_cpu_conda_v1-py" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.17" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From e413c0b8da43c75ccf06d4bf67408c91f96273ca Mon Sep 17 00:00:00 2001 From: najiyacl Date: Tue, 10 Oct 2023 17:53:09 +0530 Subject: [PATCH 2/3] Adding reference in index.json,Updating the release notebooks --- .../feature_store_querying.ipynb | 4549 +---------------- .../feature_store_quickstart.ipynb | 1904 ++----- .../feature_store_schema_evolution.ipynb | 3088 +---------- .../feature_store_spark_magic.ipynb | 561 +- notebook_examples/index.json | 69 + 5 files changed, 1014 insertions(+), 9157 deletions(-) diff --git a/notebook_examples/feature_store_querying.ipynb b/notebook_examples/feature_store_querying.ipynb index f72d6772..771aa10e 100644 --- a/notebook_examples/feature_store_querying.ipynb +++ b/notebook_examples/feature_store_querying.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "raw", - "id": "7e04d02d", + "id": "a5f5a0ea", "metadata": { "pycharm": { "name": "#%% raw\n" @@ -12,7 +12,7 @@ "qweews@notebook{feature_store-querying.ipynb,\n", " title: Using feature store for feature querying using pandas like interface for query and join,\n", " summary: Feature store quickstart guide to perform feature querying using pandas like interface for query and join.,\n", - " developed_on: pyspark32_p38_cpu_feature_store_v1,\n", + " developed_on: fspyspark32_p38_cpu_v1,\n", " keywords: feature store, querying,\n", " license: Universal Permissive License v 1.0\n", "}" @@ -21,7 +21,7 @@ { "cell_type": "code", "execution_count": null, - "id": "3d325ddb", + "id": "983875a7", "metadata": { "ExecuteTime": { "end_time": "2023-05-24T08:26:08.572567Z", @@ -34,25 +34,12 @@ "outputs": [], "source": [ "# Upgrade Oracle ADS to pick up the latest preview version to maintain compatibility with Oracle Cloud Infrastructure.\n", - "\n", - "!odsc conda install --uri https://objectstorage.us-ashburn-1.oraclecloud.com/n/bigdatadatasciencelarge/b/service-conda-packs-fs/o/service_pack/cpu/PySpark_3.2_and_Feature_Store/1.0/fspyspark32_p38_cpu_v1#conda" + "!pip install --pre --no-deps oracle-ads==2.9.0rc0" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "544cf0fe", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [] - }, { "cell_type": "markdown", - "id": "eff8a822", + "id": "3beb360a", "metadata": { "pycharm": { "name": "#%% md\n" @@ -71,26 +58,28 @@ "---\n", "# Overview:\n", "---\n", - "Managing many datasets, data-sources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift and training serving skew all leads to increased model development time and worse model performance. Here, feature store is well positioned to solve many of the problems since it provides a centralised way to transform and access data for training and serving time and helps defines a standardised pipeline for ingestion of data and querying of data. This notebook demonstrates how to use feature store within a long lasting [Oracle Cloud Infrastructure Data Flow](https://docs.oracle.com/en-us/iaas/data-flow/using/home.htm) cluster.\n", + "Managing many datasets, data-sources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift and training serving skew all leads to increased model development time and worse model performance. Here, feature store is well positioned to solve many of the problems since it provides a centralised way to transform and access data for training and serving time and helps defines a standardised pipeline for ingestion of data and querying of data.\n", "\n", - "Compatible conda pack: [PySpark 3.2 and Feature store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8\n", + "Compatible conda pack: [PySpark 3.2 and Feature Store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8 (version 1.0)\n", "\n", "## Contents:\n", "\n", "- 1. Introduction\n", - "- 1. Pre-requisites\n", - " - 2.1 Policies\n", - " - 2.2 Authentication\n", - " - 2.3 Variables\n", + "- 1. Pre-requisites to Running this Notebook\n", + " - 2.1 Setup\n", + " - 2.2 Policies\n", + " - 2.3 Authentication\n", + " - 2.4 Variables\n", "- 3. Feature store querying\n", " - 3.1. Exploration of data in feature store\n", - " - 3.2. Load feature groups\n", + " - 3.2. Create feature store logical entities\n", " - 3.3. Explore feature groups\n", " - 3.4. Select subset of features\n", " - 3.5. Filter feature groups\n", " - 3.6. Apply joins on feature group\n", " - 3.7. Create dataset from multiple or one feature group\n", " - 3.8 Free form sql query\n", + " - 3.9 Feature store Entities using YAML\n", "- 4. References\n", "\n", "---\n", @@ -104,7 +93,7 @@ }, { "cell_type": "markdown", - "id": "208425ef", + "id": "56dc5982", "metadata": { "pycharm": { "name": "#%% md\n" @@ -136,7 +125,7 @@ }, { "cell_type": "markdown", - "id": "0bb56df6", + "id": "6faf8c9a", "metadata": { "pycharm": { "name": "#%% md\n" @@ -144,24 +133,27 @@ }, "source": [ "\n", - "# 2. Pre-requisites\n", + "# 2. Pre-requisites to Running this Notebook\n", "\n", - "Data Flow Sessions are accessible through the following conda environment:\n", + "Notebook Sessions are accessible through the following conda environment: \n", "\n", - "* **PySpark 3.2, Feature store 1.0 and Data Flow 1.0 (fs_pyspark32_p38_cpu_v1)**\n", + "* **PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1)**\n", "\n", - "The [Data Catalog Hive Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and unstructured data assets. The Metastore is the central metadata repository to understand tables backed by files on object storage. You can customize `fs_pyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Data Flow session cluster. The metastore id of hive metastore is tied to feature store construct of feature store service.\n" + "You can customize `fspyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Notebook session cluster. \n" ] }, { "cell_type": "markdown", - "id": "5669e712", + "id": "5de2b05e", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ + "\n", + "### 2.1. Setup\n", + "\n", "\n", "### `spark-defaults.conf`\n", "\n", @@ -184,17 +176,12 @@ "\n", "```bash\n", "odsc data-catalog config --help\n", - "```\n", - "\n", - "\n", - "### Session Setup\n", - "\n", - "The notebook makes connections to the Data Catalog metastore and Object Storage. In the next cell, specify the bucket URI to act as the data warehouse. Use the `warehouse_uri` variable with the `oci://@/` format. Update the variable `metastore_id` with the OCID of the Data Catalog metastore." + "```" ] }, { "cell_type": "markdown", - "id": "e0977c6c", + "id": "79215ead", "metadata": { "pycharm": { "name": "#%% md\n" @@ -202,10 +189,10 @@ }, "source": [ "\n", - "### 2.1. Policies\n", + "### 2.2. Policies\n", "This section covers the creation of dynamic groups and policies needed to use the service.\n", "\n", - "* [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm/)\n", + "* [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm)\n", "* [Data Catalog Metastore Required Policies](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm)\n", "* [Getting Started with Data Flow](https://docs.oracle.com/iaas/data-flow/using/dfs_getting_started.htm)\n", "* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)" @@ -213,7 +200,7 @@ }, { "cell_type": "markdown", - "id": "455ddd75", + "id": "b8ba35e1", "metadata": { "pycharm": { "name": "#%% md\n" @@ -221,15 +208,15 @@ }, "source": [ "\n", - "### 2.2. Authentication\n", + "### 2.3. Authentication\n", "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook cluster.
\n", "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```." ] }, { "cell_type": "code", - "execution_count": 1, - "id": "964842e8", + "execution_count": null, + "id": "ec734e55", "metadata": { "ExecuteTime": { "start_time": "2023-05-24T08:26:08.577504Z" @@ -243,12 +230,12 @@ "outputs": [], "source": [ "import ads\n", - "ads.set_auth(auth=\"resource_principal\", client_kwargs={\"service_endpoint\": \"\"})" + "ads.set_auth(auth=\"resource_principal\", client_kwargs={\"fs_service_endpoint\": \"https://{api_gateway}/20230101\"})" ] }, { "cell_type": "markdown", - "id": "3eeb7367", + "id": "68ed4943", "metadata": { "pycharm": { "name": "#%% md\n" @@ -256,14 +243,14 @@ }, "source": [ "\n", - "### 2.3. Variables\n", + "### 2.4. Variables\n", "To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `` and bucket `` for offline feature store." ] }, { "cell_type": "code", - "execution_count": 2, - "id": "8471ee05", + "execution_count": null, + "id": "e6173268", "metadata": { "pycharm": { "is_executing": true, @@ -274,13 +261,13 @@ "source": [ "import os\n", "\n", - "compartment_id = \"\"\n", + "compartment_id = os.environ.get(\"NB_SESSION_COMPARTMENT_OCID\")\n", "metastore_id = \"\"" ] }, { "cell_type": "markdown", - "id": "4bcfeb4c", + "id": "18669545", "metadata": { "pycharm": { "name": "#%% md\n" @@ -288,14 +275,14 @@ }, "source": [ "\n", - "# 3. Feature group querying\n", - "By default the **PySpark 3.2, Feature store and Data Flow** conda environment includes pre-installed [great-expectations](https://legacy.docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html) and [deeque](https://github.com/awslabs/deequ) libraries. The joining functionality is heavily inspired by the APIs used by Pandas to merge, join or filter DataFrames. The APIs allow you to specify which features to select from which feature group, how to join them and which features to use in join conditions." + "# 3. Feature store querying\n", + "By default the **PySpark 3.2 and Feature Store Python 3.8** conda environment includes pre-installed [great-expectations](https://legacy.docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html) library. In an ADS feature store module, you can either use the Python programmatic or YAML interface to define feature store entities. The joining functionality is heavily inspired by the APIs used by Pandas to merge, join or filter DataFrames. The APIs allow you to specify which features to select from which feature group, how to join them and which features to use in join conditions." ] }, { "cell_type": "code", - "execution_count": 3, - "id": "b46d9ca9", + "execution_count": null, + "id": "7f696caa", "metadata": { "pycharm": { "is_executing": true, @@ -311,68 +298,15 @@ }, { "cell_type": "code", - "execution_count": 4, - "id": "ef297a89", + "execution_count": null, + "id": "5e31f620", "metadata": { "pycharm": { "is_executing": true, "name": "#%%\n" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/ads/model/deployment/model_deployment.py:48: DeprecationWarning: The `ads.model.deployment.model_deployment_properties` is deprecated in `oracle-ads 2.8.6` and will be removed in `oracle-ads 3.0`.Use `ModelDeploymentInfrastructure` and `ModelDeploymentRuntime` classes in `ads.model.deployment` module for configuring model deployment. Check https://accelerated-data-science.readthedocs.io/en/latest/user_guide/model_registration/introduction.html\n", - " from .model_deployment_properties import ModelDeploymentProperties\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/ads/model/deployment/__init__.py:7: DeprecationWarning: The `ads.model.deployment.model_deployer` is deprecated in `oracle-ads 2.8.6` and will be removed in `oracle-ads 3.0`.Use `ModelDeployment` class in `ads.model.deployment` module for initializing and deploying model deployment. Check https://accelerated-data-science.readthedocs.io/en/latest/user_guide/model_registration/introduction.html\n", - " from .model_deployer import ModelDeployer\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/pandas/__init__.py:44: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " LooseVersion(pyarrow.__version__) >= LooseVersion(\"2.0.0\")\n", - "\n", - "WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/pandas/frame.py:62: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pd.__version__) >= LooseVersion(\"0.24\"):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/pandas/missing/frame.py:81: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pd.__version__) < LooseVersion(\"1.0\"):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/pandas/missing/indexes.py:85: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pd.__version__) < LooseVersion(\"1.0\"):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/pandas/missing/indexes.py:191: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pd.__version__) < LooseVersion(\"1.0\"):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/pandas/missing/series.py:89: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pd.__version__) < LooseVersion(\"1.0\"):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/pandas/groupby.py:50: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pd.__version__) >= LooseVersion(\"1.3.0\"):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/fs/__init__.py:4: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('fs')`.\n", - "Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages\n", - " __import__(\"pkg_resources\").declare_namespace(__name__) # type: ignore\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/fs/opener/__init__.py:6: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('fs.opener')`.\n", - "Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages\n", - " __import__(\"pkg_resources\").declare_namespace(__name__) # type: ignore\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pkg_resources/__init__.py:2349: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('fs')`.\n", - "Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages\n", - " declare_namespace(parent)\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "import pandas as pd\n", "from ads.feature_store.feature_store import FeatureStore\n", @@ -388,7 +322,7 @@ }, { "cell_type": "markdown", - "id": "d01c13f1", + "id": "0a2dd067", "metadata": { "pycharm": { "name": "#%% md\n" @@ -401,136 +335,15 @@ }, { "cell_type": "code", - "execution_count": 5, - "id": "b8d4a31d", + "execution_count": null, + "id": "1989eb8d", "metadata": { "pycharm": { "is_executing": true, "name": "#%%\n" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:py.warnings:/tmp/ipykernel_4349/906484602.py:1: DtypeWarning: Columns (7,8) have mixed types. Specify dtype option on import or set low_memory=False.\n", - " flights_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/flights.csv\")[['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']]\n", - "\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
YEARMONTHDAYDAY_OF_WEEKAIRLINEFLIGHT_NUMBERORIGIN_AIRPORTDESTINATION_AIRPORT
02015114AS98ANCSEA
12015114AA2336LAXPBI
22015114US840SFOCLT
32015114AA258LAXMIA
42015114AS135SEAANC
\n", - "
" - ], - "text/plain": [ - " YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT \\\n", - "0 2015 1 1 4 AS 98 ANC \n", - "1 2015 1 1 4 AA 2336 LAX \n", - "2 2015 1 1 4 US 840 SFO \n", - "3 2015 1 1 4 AA 258 LAX \n", - "4 2015 1 1 4 AS 135 SEA \n", - "\n", - " DESTINATION_AIRPORT \n", - "0 SEA \n", - "1 PBI \n", - "2 CLT \n", - "3 MIA \n", - "4 ANC " - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "flights_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/flights.csv\")[['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']]\n", "flights_df = flights_df.head(100)\n", @@ -539,121 +352,15 @@ }, { "cell_type": "code", - "execution_count": 6, - "id": "0263f6a7", + "execution_count": null, + "id": "d1ddca21", "metadata": { "pycharm": { "is_executing": true, "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IATA_CODEAIRPORTCITYSTATECOUNTRYLATITUDELONGITUDE
0ABELehigh Valley International AirportAllentownPAUSA40.65236-75.44040
1ABIAbilene Regional AirportAbileneTXUSA32.41132-99.68190
2ABQAlbuquerque International SunportAlbuquerqueNMUSA35.04022-106.60919
3ABRAberdeen Regional AirportAberdeenSDUSA45.44906-98.42183
4ABYSouthwest Georgia Regional AirportAlbanyGAUSA31.53552-84.19447
\n", - "
" - ], - "text/plain": [ - " IATA_CODE AIRPORT CITY STATE COUNTRY \\\n", - "0 ABE Lehigh Valley International Airport Allentown PA USA \n", - "1 ABI Abilene Regional Airport Abilene TX USA \n", - "2 ABQ Albuquerque International Sunport Albuquerque NM USA \n", - "3 ABR Aberdeen Regional Airport Aberdeen SD USA \n", - "4 ABY Southwest Georgia Regional Airport Albany GA USA \n", - "\n", - " LATITUDE LONGITUDE \n", - "0 40.65236 -75.44040 \n", - "1 32.41132 -99.68190 \n", - "2 35.04022 -106.60919 \n", - "3 45.44906 -98.42183 \n", - "4 31.53552 -84.19447 " - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "airports_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airports.csv\")\n", "airports_df.head()" @@ -661,84 +368,15 @@ }, { "cell_type": "code", - "execution_count": 7, - "id": "bfac65f4", + "execution_count": null, + "id": "da859a88", "metadata": { "pycharm": { "is_executing": true, "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IATA_CODEAIRLINE
0UAUnited Air Lines Inc.
1AAAmerican Airlines Inc.
2USUS Airways Inc.
3F9Frontier Airlines Inc.
4B6JetBlue Airways
\n", - "
" - ], - "text/plain": [ - " IATA_CODE AIRLINE\n", - "0 UA United Air Lines Inc.\n", - "1 AA American Airlines Inc.\n", - "2 US US Airways Inc.\n", - "3 F9 Frontier Airlines Inc.\n", - "4 B6 JetBlue Airways" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "airlines_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airlines.csv\")\n", "airlines_df.head()" @@ -746,7 +384,7 @@ }, { "cell_type": "markdown", - "id": "88a21cff", + "id": "ac4e1264", "metadata": { "pycharm": { "name": "#%% md\n" @@ -759,21 +397,24 @@ }, { "cell_type": "markdown", - "id": "789489e5", + "id": "b4c78551", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ + "\n", "#### 3.2.1 Feature Store\n", - "Feature store is the top level entity for feature store service" + "Feature store is the top level entity for feature store service\n", + "\n", + "Call the ```.create()``` method of the Feature store instance to create a feature store." ] }, { "cell_type": "code", - "execution_count": 8, - "id": "b490664b", + "execution_count": null, + "id": "6686061a", "metadata": { "pycharm": { "is_executing": true, @@ -791,52 +432,17 @@ ")" ] }, - { - "cell_type": "markdown", - "id": "b9bb4ef6", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "\n", - "##### Create Feature Store\n", - "\n", - "Call the ```.create()``` method of the Feature store instance to create a feature store." - ] - }, { "cell_type": "code", - "execution_count": 9, - "id": "b70ade05", + "execution_count": null, + "id": "507427bb", "metadata": { "pycharm": { "is_executing": true, "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/plain": [ - "\n", - "kind: featurestore\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", - " description: Data consisting of flights\n", - " displayName: flights details\n", - " id: 751D665EB6AE7360928F15705F9F0F48\n", - " offlineConfig:\n", - " metastoreId: ocid1.datacatalogmetastore.oc1.iad.amaaaaaanif7xwiaavhd2liaebamr3tbjzio3uw2lxuteoa5ejsfvhqufbsa\n", - "type: featureStore" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "feature_store = feature_store_resource.create()\n", "feature_store" @@ -844,7 +450,7 @@ }, { "cell_type": "markdown", - "id": "4e2bc9f0", + "id": "06ff51d1", "metadata": { "pycharm": { "name": "#%% md\n" @@ -857,33 +463,14 @@ }, { "cell_type": "code", - "execution_count": 10, - "id": "a75bf559", + "execution_count": null, + "id": "fb1178da", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/plain": [ - "\n", - "kind: entity\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", - " description: description for flight details\n", - " featureStoreId: 751D665EB6AE7360928F15705F9F0F48\n", - " id: 843E320A28F319748425787F04BCD3B8\n", - " name: Flight details2\n", - "type: entity" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "entity = feature_store.create_entity(\n", " display_name=\"Flight details2\",\n", @@ -894,7 +481,7 @@ }, { "cell_type": "markdown", - "id": "6e4a0991", + "id": "8415e7ba", "metadata": { "pycharm": { "name": "#%% md\n" @@ -907,7 +494,7 @@ }, { "cell_type": "markdown", - "id": "b59e6d7d", + "id": "a1de5443", "metadata": { "pycharm": { "name": "#%% md\n" @@ -926,36 +513,14 @@ }, { "cell_type": "code", - "execution_count": 11, - "id": "9e0665c2", + "execution_count": null, + "id": "d1e7b81d", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting default log level to \"WARN\".\n", - "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", - "2023/07/14 04:29:29 NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "feature_group_flights = (\n", " FeatureGroup()\n", @@ -970,8 +535,8 @@ }, { "cell_type": "code", - "execution_count": 12, - "id": "753119fc", + "execution_count": null, + "id": "1e1dd87e", "metadata": { "collapsed": false, "jupyter": { @@ -981,583 +546,42 @@ "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/plain": [ - "\n", - "kind: FeatureGroup\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", - " entityId: 843E320A28F319748425787F04BCD3B8\n", - " featureStoreId: 751D665EB6AE7360928F15705F9F0F48\n", - " id: C24E858807F4EBA22BF14C08B9A6E2DD\n", - " inputFeatureDetails:\n", - " - featureType: LONG\n", - " name: YEAR\n", - " orderNumber: 1\n", - " - featureType: LONG\n", - " name: MONTH\n", - " orderNumber: 2\n", - " - featureType: LONG\n", - " name: DAY\n", - " orderNumber: 3\n", - " - featureType: LONG\n", - " name: DAY_OF_WEEK\n", - " orderNumber: 4\n", - " - featureType: STRING\n", - " name: AIRLINE\n", - " orderNumber: 5\n", - " - featureType: LONG\n", - " name: FLIGHT_NUMBER\n", - " orderNumber: 6\n", - " - featureType: STRING\n", - " name: ORIGIN_AIRPORT\n", - " orderNumber: 7\n", - " - featureType: STRING\n", - " name: DESTINATION_AIRPORT\n", - " orderNumber: 8\n", - " isInferSchema: true\n", - " name: flights_feature_group\n", - " primaryKeys:\n", - " items:\n", - " - name: FLIGHT_NUMBER\n", - " statisticsConfig:\n", - " isEnabled: true\n", - "type: featureGroup" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "feature_group_flights.create()" ] }, { "cell_type": "code", - "execution_count": 13, - "id": "7c7b8e9b", + "execution_count": null, + "id": "f41999bb", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "%3\n", - "\n", - "\n", - "751D665EB6AE7360928F15705F9F0F48\n", - "\n", - "flights details\n", - "Feature Store\n", - "751D665EB6AE7360928F15705F9F0F48\n", - "\n", - "\n", - "843E320A28F319748425787F04BCD3B8\n", - "\n", - "Flight details2\n", - "Entity\n", - "843E320A28F319748425787F04BCD3B8\n", - "\n", - "\n", - "751D665EB6AE7360928F15705F9F0F48->843E320A28F319748425787F04BCD3B8\n", - "\n", - "\n", - "\n", - "\n", - "C24E858807F4EBA22BF14C08B9A6E2DD\n", - "\n", - "flights_feature_group\n", - "Feature Group\n", - "C24E858807F4EBA22BF14C08B9A6E2DD\n", - "\n", - "\n", - "843E320A28F319748425787F04BCD3B8->C24E858807F4EBA22BF14C08B9A6E2DD\n", - "\n", - "\n", - "\n", - "\n", - "\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "feature_group_flights.show()" ] }, { "cell_type": "code", - "execution_count": 14, - "id": "8d28daf4", + "execution_count": null, + "id": "6f22a65c", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Hive Session ID = 59994193-ab1d-4749-8d21-17cc661a95c6\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:mlm_insights.builder:validating required components\n", - "INFO:mlm_insights.builder:required components validated\n", - "INFO:mlm_insights.builder.usage:Activating Minimal Insights Usage\n", - "INFO:mlm_insights.builder:Generating Runner object\n", - "INFO:mlm_insights.builder:Generating workflow request\n", - "INFO:mlm_insights.workflow:Fetching engine object\n", - "INFO:mlm_insights.workflow:Returning native engine object\n", - "INFO:mlm_insights.builder:Running Fugue Workflow\n", - "INFO:mlm_insights.workflow:Executing Fugue Workflow\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", - " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", - " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", - " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", - " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", - " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", - " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", - " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", - " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", - "[Stage 8:=============================> (1 + 1) / 2]\r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9399bf0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=2015.0, minimum=2015.0, maximum=2015.0, central_moments=[1.0, 0.0, 0.0, 0.0, 0.0]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9399d70>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9399cf0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9203930>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=1.0, minimum=1.0, maximum=1.0, central_moments=[1.0, 0.0, 0.0, 0.0, 0.0]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9399e70>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9399db0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef939a230>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=1.0, minimum=1.0, maximum=1.0, central_moments=[1.0, 0.0, 0.0, 0.0, 0.0]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef939a570>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef939a470>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef939a970>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=4.0, minimum=4.0, maximum=4.0, central_moments=[1.0, 0.0, 0.0, 0.0, 0.0]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef939abf0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef939aaf0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9399630>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef93a9030>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef93a9530>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=1711.5100000000002, minimum=17.0, maximum=7419.0, central_moments=[1.0, 0.0, 3509091.8299000002, 10157914842.877602, 55483811382672.16]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef93a97b0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef93a96b0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef93a9af0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef93a9bb0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef93a9cb0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9416830>)} sfc map\n", - "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", - "INFO:mlm_insights.builder:Profile Generated Successfully\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 2015.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [2015.0], 'density': [1.0]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [2015.0], 'frequency': [100]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 2015.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 201500.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 2015.0, 'q2': 2015.0, 'q3': 2015.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 2015.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 1.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [1.0], 'density': [1.0]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [1.0], 'frequency': [100]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 1.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 100.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 1.0, 'q2': 1.0, 'q3': 1.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 1.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 1.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [1.0], 'density': [1.0]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [1.0], 'frequency': [100]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 1.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 100.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 1.0, 'q2': 1.0, 'q3': 1.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 1.0\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 4.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [4.0], 'density': [1.0]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [4.0], 'frequency': [100]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 4.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 400.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 4.0, 'q2': 4.0, 'q3': 4.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 4.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='AA', estimate=14, lower_bound=14, upper_bound=14), FrequentItemEstimate(value='B6', estimate=12, lower_bound=12, upper_bound=12), FrequentItemEstimate(value='NK', estimate=11, lower_bound=11, upper_bound=11), FrequentItemEstimate(value='UA', estimate=11, lower_bound=11, upper_bound=11), FrequentItemEstimate(value='AS', estimate=11, lower_bound=11, upper_bound=11), FrequentItemEstimate(value='DL', estimate=11, lower_bound=11, upper_bound=11), FrequentItemEstimate(value='US', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='OO', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='EV', estimate=7, lower_bound=7, upper_bound=7), FrequentItemEstimate(value='HA', estimate=5, lower_bound=5, upper_bound=5)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 12.000000327825557 in Distinct count SFC, upper bound = 12.000599478849342, lower bound = 12.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 88, 'percentage': 88.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['AA']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 12.000000327825557 in Distinct count SFC, upper bound = 12.000599478849342, lower bound = 12.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 12.000000327825557\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: 1.5452988004009884\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 1873.257011170651\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 17.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 1905.0\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 7402.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [17.0, 272.2413793103448, 527.4827586206897, 782.7241379310344, 1037.9655172413793, 1293.2068965517242, 1548.4482758620688, 1803.6896551724137, 2058.9310344827586, 2314.1724137931033, 2569.4137931034484, 2824.655172413793, 3079.8965517241377, 3335.137931034483, 3590.3793103448274, 3845.6206896551726, 4100.862068965517, 4356.103448275862, 4611.3448275862065, 4866.586206896552, 5121.827586206897, 5377.068965517241, 5632.310344827586, 5887.551724137931, 6142.793103448275, 6398.0344827586205, 6653.275862068966, 6908.517241379311, 7163.758620689655, 7419.0], 'density': [0.22, 0.1, 0.10999999999999999, 0.04999999999999999, 0.08999999999999997, 0.07000000000000006, 0.040000000000000036, 0.039999999999999925, 0.040000000000000036, 0.06999999999999995, 0.010000000000000009, 0.010000000000000009, 0.0, 0.0, 0.0, 0.0, 0.010000000000000009, 0.010000000000000009, 0.010000000000000009, 0.0, 0.030000000000000027, 0.039999999999999925, 0.010000000000000009, 0.0, 0.010000000000000009, 0.0, 0.0, 0.0, 0.020000000000000018, 0.010000000000000009]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 3509091.8299000002\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [17.0, 272.2413793103448, 527.4827586206897, 782.7241379310344, 1037.9655172413793, 1293.2068965517242, 1548.4482758620688, 1803.6896551724137, 2058.9310344827586, 2314.1724137931033, 2569.4137931034484, 2824.655172413793, 3079.8965517241377, 3335.137931034483, 3590.3793103448274, 3845.6206896551726, 4100.862068965517, 4356.103448275862, 4611.3448275862065, 4866.586206896552, 5121.827586206897, 5377.068965517241, 5632.310344827586, 5887.551724137931, 6142.793103448275, 6398.0344827586205, 6653.275862068966, 6908.517241379311, 7163.758620689655, 7419.0], 'frequency': [22, 10, 11, 5, 9, 7, 4, 4, 4, 7, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 3, 4, 1, 0, 1, 0, 0, 0, 2, 1]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 7419.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 96.00002264977122 in Distinct count SFC, upper bound = 96.00481585896145, lower bound = 96.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 96.00002264977122\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 171151.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 371.0, 'q2': 1162.0, 'q3': 2276.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 1711.5100000000002\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.5058509315336428\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='ANC', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='LAS', estimate=9, lower_bound=9, upper_bound=9), FrequentItemEstimate(value='SJU', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='LAX', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='SFO', estimate=5, lower_bound=5, upper_bound=5), FrequentItemEstimate(value='PHX', estimate=5, lower_bound=5, upper_bound=5), FrequentItemEstimate(value='SEA', estimate=5, lower_bound=5, upper_bound=5), FrequentItemEstimate(value='HNL', estimate=4, lower_bound=4, upper_bound=4), FrequentItemEstimate(value='ORD', estimate=4, lower_bound=4, upper_bound=4), FrequentItemEstimate(value='PDX', estimate=3, lower_bound=3, upper_bound=3)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 44.000004698833415 in Distinct count SFC, upper bound = 44.00220158609522, lower bound = 44.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 56, 'percentage': 56.00000000000001}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['ANC']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 44.000004698833415 in Distinct count SFC, upper bound = 44.00220158609522, lower bound = 44.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 44.000004698833415\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='MIA', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='IAH', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='MSP', estimate=9, lower_bound=9, upper_bound=9), FrequentItemEstimate(value='SEA', estimate=9, lower_bound=9, upper_bound=9), FrequentItemEstimate(value='ATL', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='DFW', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='MCO', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='DEN', estimate=5, lower_bound=5, upper_bound=5), FrequentItemEstimate(value='PHX', estimate=4, lower_bound=4, upper_bound=4), FrequentItemEstimate(value='CLT', estimate=4, lower_bound=4, upper_bound=4)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 29.00000201662398 in Distinct count SFC, upper bound = 29.00144996499259, lower bound = 29.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 71, 'percentage': 71.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['MIA', 'IAH']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 29.00000201662398 in Distinct count SFC, upper bound = 29.00144996499259, lower bound = 29.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 29.00000201662398\n", - "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 100.0\n", - "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", - "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", - "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", - "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", - "│ C24E858807F4EBA22BF14C08B9A6E2DD │ FEATURE_GROUP │ Succeeded │ None │\n", - "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" - ] - } - ], + "outputs": [], "source": [ "feature_group_flights.materialise(flights_df)" ] }, { "cell_type": "markdown", - "id": "41d796d5", + "id": "174992cd", "metadata": { "pycharm": { "name": "#%% md\n" @@ -1572,25 +596,14 @@ }, { "cell_type": "code", - "execution_count": 15, - "id": "4c247dde", + "execution_count": null, + "id": "d2ff01e9", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/plain": [ - "{\"meta\": {}, \"expectation_type\": \"expect_column_values_to_be_between\", \"kwargs\": {\"column\": \"LONGITUDE\", \"min_value\": -1.0, \"max_value\": 1.0}}" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "expectation_suite_airports = ExpectationSuite(\n", " expectation_suite_name=\"test_airports_df\"\n", @@ -1618,27 +631,14 @@ }, { "cell_type": "code", - "execution_count": 16, - "id": "81863e53", + "execution_count": null, + "id": "0a4e00ee", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "feature_group_airports = (\n", " FeatureGroup()\n", @@ -1657,8 +657,8 @@ }, { "cell_type": "code", - "execution_count": 17, - "id": "e1920d4c", + "execution_count": null, + "id": "f16d798b", "metadata": { "collapsed": false, "jupyter": { @@ -1668,472 +668,42 @@ "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/plain": [ - "\n", - "kind: FeatureGroup\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", - " entityId: 843E320A28F319748425787F04BCD3B8\n", - " expectationDetails:\n", - " createRuleDetails:\n", - " - arguments:\n", - " column: IATA_CODE\n", - " levelType: ERROR\n", - " name: Rule-0\n", - " ruleType: expect_column_values_to_not_be_null\n", - " - arguments:\n", - " column: LATITUDE\n", - " max_value: 1.0\n", - " min_value: -1.0\n", - " levelType: ERROR\n", - " name: Rule-1\n", - " ruleType: expect_column_values_to_be_between\n", - " - arguments:\n", - " column: LONGITUDE\n", - " max_value: 1.0\n", - " min_value: -1.0\n", - " levelType: ERROR\n", - " name: Rule-2\n", - " ruleType: expect_column_values_to_be_between\n", - " expectationType: LENIENT\n", - " name: test_airports_df\n", - " validationEngineType: GREAT_EXPECTATIONS\n", - " featureStoreId: 751D665EB6AE7360928F15705F9F0F48\n", - " id: C1771CFDA79A082BB9FB85D9E5FCB192\n", - " inputFeatureDetails:\n", - " - featureType: STRING\n", - " name: IATA_CODE\n", - " orderNumber: 1\n", - " - featureType: STRING\n", - " name: AIRPORT\n", - " orderNumber: 2\n", - " - featureType: STRING\n", - " name: CITY\n", - " orderNumber: 3\n", - " - featureType: STRING\n", - " name: STATE\n", - " orderNumber: 4\n", - " - featureType: STRING\n", - " name: COUNTRY\n", - " orderNumber: 5\n", - " - featureType: DOUBLE\n", - " name: LATITUDE\n", - " orderNumber: 6\n", - " - featureType: DOUBLE\n", - " name: LONGITUDE\n", - " orderNumber: 7\n", - " isInferSchema: true\n", - " name: airport_feature_group\n", - " primaryKeys:\n", - " items:\n", - " - name: IATA_CODE\n", - " statisticsConfig:\n", - " isEnabled: true\n", - "type: featureGroup" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "feature_group_airports.create()" ] }, { "cell_type": "code", - "execution_count": 18, - "id": "7a78eaa2", + "execution_count": null, + "id": "eab02fe6", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:great_expectations.validator.validator:\t3 expectation(s) included in expectation_suite.\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "db29009398704583b95af2e91841296e", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Calculating Metrics: 0%| | 0/16 [00:00), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8efbe47830>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9584930>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9584270>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9584870>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef958f230>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef958f670>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef958f630>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef958fab0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef958fa70>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9596130>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=38.9812439184953, minimum=13.48345, maximum=71.28545, central_moments=[1.0, 8.909626780690911e-17, 74.01537930806269, 262.87069420949706, 26574.825385423774]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9596330>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9596230>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9596570>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=-98.37896445141065, minimum=-176.64603, maximum=-64.79856, central_moments=[1.0, 0.0, 461.80848194502215, -11904.62460720004, 932401.3978279813]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef95967b0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef95966b0>)} sfc map\n", - "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", - "INFO:mlm_insights.builder:Profile Generated Successfully\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 14, 'percentage': 4.3478260869565215}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 308.000234832572\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='TX', estimate=24, lower_bound=24, upper_bound=24), FrequentItemEstimate(value='CA', estimate=22, lower_bound=22, upper_bound=22), FrequentItemEstimate(value='AK', estimate=19, lower_bound=19, upper_bound=19), FrequentItemEstimate(value='FL', estimate=17, lower_bound=17, upper_bound=17), FrequentItemEstimate(value='MI', estimate=15, lower_bound=15, upper_bound=15), FrequentItemEstimate(value='NY', estimate=14, lower_bound=14, upper_bound=14), FrequentItemEstimate(value='CO', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='NC', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='MN', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='WI', estimate=8, lower_bound=8, upper_bound=8)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 268, 'percentage': 83.22981366459628}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['TX']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 54.00000710785499\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='USA', estimate=322, lower_bound=322, upper_bound=322)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 321, 'percentage': 99.68944099378882}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['USA']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: 0.41281856359758584\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 8.603219124726667\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 13.48345\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 9.529050000000005\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 57.802\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'density': [0.003134796238244514, 0.0, 0.015673981191222573, 0.01567398119122257, 0.0031347962382445166, 0.0, 0.025078369905956105, 0.021943573667711602, 0.07210031347962384, 0.07836990595611285, 0.10658307210031348, 0.0658307210031348, 0.09404388714733536, 0.11598746081504707, 0.13479623824451414, 0.07836990595611282, 0.06896551724137934, 0.037617554858934144, 0.0, 0.006269592476489061, 0.0, 0.01253918495297801, 0.01567398119122254, 0.012539184952978122, 0.0, 0.0031347962382445305, 0.0031347962382444194, 0.0, 0.0031347962382445305, 0.006269592476489061]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 74.01537930806269\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'frequency': [1, 0, 5, 5, 1, 0, 8, 7, 23, 25, 34, 21, 30, 37, 43, 25, 22, 12, 0, 2, 0, 4, 5, 4, 0, 1, 1, 0, 1, 2]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 71.28545\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 12435.01681\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 33.64044, 'q2': 39.29761, 'q3': 43.16949}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 38.9812439184953\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.850946460274213\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: -1.199562407919743\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 21.489729685247838\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: -176.64603\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 28.386920000000003\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 111.84747\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'density': [0.0, 0.003134796238244514, 0.003134796238244514, 0.009404388714733541, 0.009404388714733543, 0.01567398119122257, 0.006269592476489033, 0.0031347962382445096, 0.006269592476489033, 0.006269592476489019, 0.00940438871473355, 0.006269592476489033, 0.0, 0.018808777429467072, 0.05642633228840126, 0.040752351097178674, 0.05015673981191224, 0.03448275862068967, 0.043887147335423204, 0.037617554858934144, 0.09090909090909094, 0.08463949843260188, 0.08777429467084641, 0.10031347962382442, 0.09404388714733547, 0.08150470219435735, 0.056426332288401215, 0.028213166144200663, 0.00940438871473348, 0.006269592476489061]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 461.80848194502215\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'frequency': [0, 1, 1, 3, 3, 5, 2, 1, 2, 2, 3, 2, 0, 6, 18, 13, 16, 11, 14, 12, 29, 27, 28, 32, 30, 26, 18, 9, 3, 2]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: -64.79856\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: -31382.88966\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': -110.94103, 'q2': -93.40307, 'q3': -82.55411}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: -98.37896445141065\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.3719894513293207\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 322.0\n", - "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", - "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", - "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", - "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", - "│ C1771CFDA79A082BB9FB85D9E5FCB192 │ FEATURE_GROUP │ Succeeded │ None │\n", - "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" - ] - } - ], + "outputs": [], "source": [ "feature_group_airports.materialise(airports_df)" ] }, { "cell_type": "code", - "execution_count": 19, - "id": "44277176", + "execution_count": null, + "id": "c404fd39", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "%3\n", - "\n", - "\n", - "751D665EB6AE7360928F15705F9F0F48\n", - "\n", - "flights details\n", - "Feature Store\n", - "751D665EB6AE7360928F15705F9F0F48\n", - "\n", - "\n", - "843E320A28F319748425787F04BCD3B8\n", - "\n", - "Flight details2\n", - "Entity\n", - "843E320A28F319748425787F04BCD3B8\n", - "\n", - "\n", - "751D665EB6AE7360928F15705F9F0F48->843E320A28F319748425787F04BCD3B8\n", - "\n", - "\n", - "\n", - "\n", - "C1771CFDA79A082BB9FB85D9E5FCB192\n", - "\n", - "airport_feature_group\n", - "Feature Group\n", - "C1771CFDA79A082BB9FB85D9E5FCB192\n", - "\n", - "\n", - "843E320A28F319748425787F04BCD3B8->C1771CFDA79A082BB9FB85D9E5FCB192\n", - "\n", - "\n", - "\n", - "\n", - "\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "feature_group_airports.show()" ] }, { "cell_type": "markdown", - "id": "d842551d", + "id": "9d44607e", "metadata": { "pycharm": { "name": "#%% md\n" @@ -2152,7 +722,7 @@ }, { "cell_type": "markdown", - "id": "31a33a56", + "id": "6800691b", "metadata": { "pycharm": { "name": "#%% md\n" @@ -2164,25 +734,14 @@ }, { "cell_type": "code", - "execution_count": 20, - "id": "f3c7a4c2", + "execution_count": null, + "id": "b493fedc", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/plain": [ - "{\"meta\": {}, \"expectation_type\": \"expect_column_values_to_not_be_null\", \"kwargs\": {\"column\": \"IATA_CODE\"}}" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "expectation_suite_airlines = ExpectationSuite(\n", " expectation_suite_name=\"test_airlines_df\"\n", @@ -2197,27 +756,14 @@ }, { "cell_type": "code", - "execution_count": 21, - "id": "1b9ad0dc", + "execution_count": null, + "id": "b065942d", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "feature_group_airlines = (\n", " FeatureGroup()\n", @@ -2236,8 +782,8 @@ }, { "cell_type": "code", - "execution_count": 22, - "id": "35cea00f", + "execution_count": null, + "id": "fea7a0fa", "metadata": { "collapsed": false, "jupyter": { @@ -2247,259 +793,42 @@ "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/plain": [ - "\n", - "kind: FeatureGroup\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", - " entityId: 843E320A28F319748425787F04BCD3B8\n", - " expectationDetails:\n", - " createRuleDetails:\n", - " - arguments:\n", - " column: IATA_CODE\n", - " levelType: ERROR\n", - " name: Rule-0\n", - " ruleType: expect_column_values_to_not_be_null\n", - " expectationType: STRICT\n", - " name: test_airlines_df\n", - " validationEngineType: GREAT_EXPECTATIONS\n", - " featureStoreId: 751D665EB6AE7360928F15705F9F0F48\n", - " id: 4E21D2D878A101E8804837CAD6499FD9\n", - " inputFeatureDetails:\n", - " - featureType: STRING\n", - " name: IATA_CODE\n", - " orderNumber: 1\n", - " - featureType: STRING\n", - " name: AIRLINE\n", - " orderNumber: 2\n", - " isInferSchema: true\n", - " name: airlines_feature_group\n", - " primaryKeys:\n", - " items:\n", - " - name: IATA_CODE\n", - " statisticsConfig:\n", - " isEnabled: true\n", - "type: featureGroup" - ] - }, - "execution_count": 22, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "feature_group_airlines.create()" ] }, { "cell_type": "code", - "execution_count": 23, - "id": "ae7c7ff9", + "execution_count": null, + "id": "00c8f7bc", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:great_expectations.validator.validator:\t1 expectation(s) included in expectation_suite.\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "971d31d06c77444eadef7392d6903b71", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Calculating Metrics: 0%| | 0/6 [00:00), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef956b430>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef95cdc30>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef95cdb70>)} sfc map\n", - "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", - "INFO:mlm_insights.builder:Profile Generated Successfully\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 14.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='UA', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='AA', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='NK', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='VX', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='OO', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='WN', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='US', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='DL', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='AS', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='B6', estimate=1, lower_bound=1, upper_bound=1)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 14, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 14.000000452001906 in Distinct count SFC, upper bound = 14.000699461533127, lower bound = 14.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['UA', 'AA']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 14.000000452001906 in Distinct count SFC, upper bound = 14.000699461533127, lower bound = 14.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 14\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 14.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='Skywest Airlines Inc.', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='American Eagle Airlines Inc.', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='Frontier Airlines Inc.', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='Atlantic Southeast Airlines', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='Southwest Airlines Co.', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='Hawaiian Airlines Inc.', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='American Airlines Inc.', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='Virgin America', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='Spirit Air Lines', estimate=1, lower_bound=1, upper_bound=1), FrequentItemEstimate(value='JetBlue Airways', estimate=1, lower_bound=1, upper_bound=1)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 14, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 14.000000452001906 in Distinct count SFC, upper bound = 14.000699461533127, lower bound = 14.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['Skywest Airlines Inc.', 'American Eagle Airlines Inc.']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 14.000000452001906 in Distinct count SFC, upper bound = 14.000699461533127, lower bound = 14.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 14\n", - "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 14.0\n", - "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", - "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", - "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", - "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", - "│ 4E21D2D878A101E8804837CAD6499FD9 │ FEATURE_GROUP │ Succeeded │ None │\n", - "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" - ] - } - ], + "outputs": [], "source": [ "feature_group_airlines.materialise(airlines_df)" ] }, { "cell_type": "code", - "execution_count": 24, - "id": "1c4dcf81", + "execution_count": null, + "id": "45b463d9", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "%3\n", - "\n", - "\n", - "751D665EB6AE7360928F15705F9F0F48\n", - "\n", - "flights details\n", - "Feature Store\n", - "751D665EB6AE7360928F15705F9F0F48\n", - "\n", - "\n", - "843E320A28F319748425787F04BCD3B8\n", - "\n", - "Flight details2\n", - "Entity\n", - "843E320A28F319748425787F04BCD3B8\n", - "\n", - "\n", - "751D665EB6AE7360928F15705F9F0F48->843E320A28F319748425787F04BCD3B8\n", - "\n", - "\n", - "\n", - "\n", - "4E21D2D878A101E8804837CAD6499FD9\n", - "\n", - "airlines_feature_group\n", - "Feature Group\n", - "4E21D2D878A101E8804837CAD6499FD9\n", - "\n", - "\n", - "843E320A28F319748425787F04BCD3B8->4E21D2D878A101E8804837CAD6499FD9\n", - "\n", - "\n", - "\n", - "\n", - "\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "feature_group_airlines.show()" ] }, { "cell_type": "markdown", - "id": "cb4e05d6", + "id": "e33b817c", "metadata": { "pycharm": { "name": "#%% md\n" @@ -2512,1140 +841,178 @@ }, { "cell_type": "code", - "execution_count": 25, - "id": "a00444ad", + "execution_count": null, + "id": "8228ed24", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
nametypefeature_group_id
0YEARLONGC24E858807F4EBA22BF14C08B9A6E2DD
1MONTHLONGC24E858807F4EBA22BF14C08B9A6E2DD
2DAYLONGC24E858807F4EBA22BF14C08B9A6E2DD
3DAY_OF_WEEKLONGC24E858807F4EBA22BF14C08B9A6E2DD
4AIRLINESTRINGC24E858807F4EBA22BF14C08B9A6E2DD
5FLIGHT_NUMBERLONGC24E858807F4EBA22BF14C08B9A6E2DD
6ORIGIN_AIRPORTSTRINGC24E858807F4EBA22BF14C08B9A6E2DD
7DESTINATION_AIRPORTSTRINGC24E858807F4EBA22BF14C08B9A6E2DD
\n", - "
" - ], - "text/plain": [ - " name type feature_group_id\n", - "0 YEAR LONG C24E858807F4EBA22BF14C08B9A6E2DD\n", - "1 MONTH LONG C24E858807F4EBA22BF14C08B9A6E2DD\n", - "2 DAY LONG C24E858807F4EBA22BF14C08B9A6E2DD\n", - "3 DAY_OF_WEEK LONG C24E858807F4EBA22BF14C08B9A6E2DD\n", - "4 AIRLINE STRING C24E858807F4EBA22BF14C08B9A6E2DD\n", - "5 FLIGHT_NUMBER LONG C24E858807F4EBA22BF14C08B9A6E2DD\n", - "6 ORIGIN_AIRPORT STRING C24E858807F4EBA22BF14C08B9A6E2DD\n", - "7 DESTINATION_AIRPORT STRING C24E858807F4EBA22BF14C08B9A6E2DD" - ] - }, - "execution_count": 25, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "feature_group_flights.get_features_df()" ] }, { "cell_type": "code", - "execution_count": 26, - "id": "1e492391", + "execution_count": null, + "id": "fcf3b866", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
nametypefeature_group_id
0IATA_CODESTRINGC1771CFDA79A082BB9FB85D9E5FCB192
1AIRPORTSTRINGC1771CFDA79A082BB9FB85D9E5FCB192
2CITYSTRINGC1771CFDA79A082BB9FB85D9E5FCB192
3STATESTRINGC1771CFDA79A082BB9FB85D9E5FCB192
4COUNTRYSTRINGC1771CFDA79A082BB9FB85D9E5FCB192
5LATITUDEDOUBLEC1771CFDA79A082BB9FB85D9E5FCB192
6LONGITUDEDOUBLEC1771CFDA79A082BB9FB85D9E5FCB192
\n", - "
" - ], - "text/plain": [ - " name type feature_group_id\n", - "0 IATA_CODE STRING C1771CFDA79A082BB9FB85D9E5FCB192\n", - "1 AIRPORT STRING C1771CFDA79A082BB9FB85D9E5FCB192\n", - "2 CITY STRING C1771CFDA79A082BB9FB85D9E5FCB192\n", - "3 STATE STRING C1771CFDA79A082BB9FB85D9E5FCB192\n", - "4 COUNTRY STRING C1771CFDA79A082BB9FB85D9E5FCB192\n", - "5 LATITUDE DOUBLE C1771CFDA79A082BB9FB85D9E5FCB192\n", - "6 LONGITUDE DOUBLE C1771CFDA79A082BB9FB85D9E5FCB192" - ] - }, - "execution_count": 26, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "feature_group_airports.get_features_df()" ] }, { "cell_type": "code", - "execution_count": 27, - "id": "dbde287a", + "execution_count": null, + "id": "a730f3f1", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
nametypefeature_group_id
0IATA_CODESTRING4E21D2D878A101E8804837CAD6499FD9
1AIRLINESTRING4E21D2D878A101E8804837CAD6499FD9
\n", - "
" - ], - "text/plain": [ - " name type feature_group_id\n", - "0 IATA_CODE STRING 4E21D2D878A101E8804837CAD6499FD9\n", - "1 AIRLINE STRING 4E21D2D878A101E8804837CAD6499FD9" - ] - }, - "execution_count": 27, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "feature_group_airlines.get_features_df()" ] }, + { + "cell_type": "markdown", + "id": "a89ddd8f", + "metadata": {}, + "source": [ + "You can retrieve feature data in a DataFrame, that can either be used to train models." + ] + }, { "cell_type": "code", - "execution_count": 28, - "id": "9c15fb2e", + "execution_count": null, + "id": "a5ccdb47", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[Stage 35:> (0 + 1) / 1]\r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----+-----+---+-----------+-------+-------------+--------------+-------------------+\n", - "|YEAR|MONTH|DAY|DAY_OF_WEEK|AIRLINE|FLIGHT_NUMBER|ORIGIN_AIRPORT|DESTINATION_AIRPORT|\n", - "+----+-----+---+-----------+-------+-------------+--------------+-------------------+\n", - "|2015| 1| 1| 4| B6| 1030| BQN| MCO|\n", - "|2015| 1| 1| 4| B6| 262| SJU| BOS|\n", - "|2015| 1| 1| 4| B6| 2134| SJU| MCO|\n", - "|2015| 1| 1| 4| B6| 730| BQN| MCO|\n", - "|2015| 1| 1| 4| B6| 768| PSE| MCO|\n", - "|2015| 1| 1| 4| B6| 2276| SJU| BDL|\n", - "|2015| 1| 1| 4| US| 602| ORD| PHX|\n", - "|2015| 1| 1| 4| AS| 695| GEG| SEA|\n", - "|2015| 1| 1| 4| HA| 102| HNL| ITO|\n", - "|2015| 1| 1| 4| OO| 5467| ONT| SFO|\n", - "+----+-----+---+-----------+-------+-------------+--------------+-------------------+\n", - "only showing top 10 rows\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - } - ], + "outputs": [], "source": [ "feature_group_flights.select().show()" ] }, { "cell_type": "code", - "execution_count": 29, - "id": "1fa80478", + "execution_count": null, + "id": "9e4151b1", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[Stage 38:> (0 + 1) / 1]\r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+--------------------+-------------+-----+-------+--------+----------+\n", - "|IATA_CODE| AIRPORT| CITY|STATE|COUNTRY|LATITUDE| LONGITUDE|\n", - "+---------+--------------------+-------------+-----+-------+--------+----------+\n", - "| ABE|Lehigh Valley Int...| Allentown| PA| USA|40.65236| -75.4404|\n", - "| ABI|Abilene Regional ...| Abilene| TX| USA|32.41132| -99.6819|\n", - "| ABQ|Albuquerque Inter...| Albuquerque| NM| USA|35.04022|-106.60919|\n", - "| ABR|Aberdeen Regional...| Aberdeen| SD| USA|45.44906| -98.42183|\n", - "| ABY|Southwest Georgia...| Albany| GA| USA|31.53552| -84.19447|\n", - "| ACK|Nantucket Memoria...| Nantucket| MA| USA|41.25305| -70.06018|\n", - "| ACT|Waco Regional Air...| Waco| TX| USA|31.61129| -97.23052|\n", - "| ACV| Arcata Airport|Arcata/Eureka| CA| USA|40.97812|-124.10862|\n", - "| ACY|Atlantic City Int...|Atlantic City| NJ| USA|39.45758| -74.57717|\n", - "| ADK| Adak Airport| Adak| AK| USA|51.87796|-176.64603|\n", - "+---------+--------------------+-------------+-----+-------+--------+----------+\n", - "only showing top 10 rows\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - } - ], + "outputs": [], "source": [ "feature_group_airports.select().show()" ] }, { "cell_type": "code", - "execution_count": 30, - "id": "dbb37e5c", + "execution_count": null, + "id": "18dc0c4f", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+--------------------+\n", - "|IATA_CODE| AIRLINE|\n", - "+---------+--------------------+\n", - "| NK| Spirit Air Lines|\n", - "| WN|Southwest Airline...|\n", - "| DL|Delta Air Lines Inc.|\n", - "| EV|Atlantic Southeas...|\n", - "| HA|Hawaiian Airlines...|\n", - "| MQ|American Eagle Ai...|\n", - "| VX| Virgin America|\n", - "| UA|United Air Lines ...|\n", - "| AA|American Airlines...|\n", - "| US| US Airways Inc.|\n", - "+---------+--------------------+\n", - "only showing top 10 rows\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "feature_group_airlines.select().show()" ] }, + { + "cell_type": "markdown", + "id": "5dfab426", + "metadata": {}, + "source": [ + "You can call the `get_statistics()` method of the feature group to fetch statistics for a specific ingestion job.You can use `to_pandas()` or `to_json()` to view the statistics.\n", + "You can visualize feature statistics with `to_viz()`" + ] + }, { "cell_type": "code", - "execution_count": 31, - "id": "e67ea0f5", + "execution_count": null, + "id": "cffeb756", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IATA_CODEAIRLINE
Count{'total_count': 14, 'missing_count': 0, 'missing_count_percentage': 0.0}{'total_count': 14, 'missing_count': 0, 'missing_count_percentage': 0.0}
TopKFrequentElements{'value': [{'value': 'UA', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'AA', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'NK', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'VX', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'OO', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'WN', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'US', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'DL', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'AS', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'B6', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}]}{'value': [{'value': 'Skywest Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'American Eagle Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Frontier Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Atlantic Southeast Airlines', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Southwest Airlines Co.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Hawaiian Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'American Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Virgin America', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Spirit Air Lines', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'JetBlue Airways', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}]}
TypeMetric{'string_type_count': 14, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}{'string_type_count': 14, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}
DuplicateCount{'count': 0, 'percentage': 0.0}{'count': 0, 'percentage': 0.0}
Mode{'value': ['UA', 'AA']}{'value': ['Skywest Airlines Inc.', 'American Eagle Airlines Inc.']}
DistinctCount{'value': 14}{'value': 14}
\n", - "
" - ], - "text/plain": [ - " IATA_CODE \\\n", - "Count {'total_count': 14, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", - "TopKFrequentElements {'value': [{'value': 'UA', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'AA', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'NK', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'VX', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'OO', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'WN', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'US', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'DL', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'AS', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'B6', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}]} \n", - "TypeMetric {'string_type_count': 14, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", - "DuplicateCount {'count': 0, 'percentage': 0.0} \n", - "Mode {'value': ['UA', 'AA']} \n", - "DistinctCount {'value': 14} \n", - "\n", - " AIRLINE \n", - "Count {'total_count': 14, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", - "TopKFrequentElements {'value': [{'value': 'Skywest Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'American Eagle Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Frontier Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Atlantic Southeast Airlines', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Southwest Airlines Co.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Hawaiian Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'American Airlines Inc.', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Virgin America', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'Spirit Air Lines', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}, {'value': 'JetBlue Airways', 'estimate': 1, 'lower_bound': 1, 'upper_bound': 1}]} \n", - "TypeMetric {'string_type_count': 14, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", - "DuplicateCount {'count': 0, 'percentage': 0.0} \n", - "Mode {'value': ['Skywest Airlines Inc.', 'American Eagle Airlines Inc.']} \n", - "DistinctCount {'value': 14} " - ] - }, - "execution_count": 31, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "feature_group_airlines.get_statistics().to_pandas()" ] }, { "cell_type": "code", - "execution_count": 32, - "id": "583db211", + "execution_count": null, + "id": "b1fdd7f6", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
YEARMONTHDAYDAY_OF_WEEKAIRLINEFLIGHT_NUMBERORIGIN_AIRPORTDESTINATION_AIRPORT
Skewness{'value': None}{'value': None}{'value': None}{'value': None}NaN{'value': 1.545298800400988}NaNNaN
StandardDeviation{'value': 0.0}{'value': 0.0}{'value': 0.0}{'value': 0.0}NaN{'value': 1873.257011170651}NaNNaN
Min{'value': 2015.0}{'value': 1.0}{'value': 1.0}{'value': 4.0}NaN{'value': 17.0}NaNNaN
IsConstantFeature{'value': True}{'value': True}{'value': True}{'value': True}NaN{'value': False}NaNNaN
IQR{'value': 0.0}{'value': 0.0}{'value': 0.0}{'value': 0.0}NaN{'value': 1905.0}NaNNaN
Range{'value': 0.0}{'value': 0.0}{'value': 0.0}{'value': 0.0}NaN{'value': 7402.0}NaNNaN
ProbabilityDistribution{'bins': [2015.0], 'density': [1.0]}{'bins': [1.0], 'density': [1.0]}{'bins': [1.0], 'density': [1.0]}{'bins': [4.0], 'density': [1.0]}NaN{'bins': [17.0, 272.2413793103448, 527.4827586206897, 782.7241379310344, 1037.9655172413793, 1293.2068965517242, 1548.4482758620688, 1803.6896551724137, 2058.9310344827586, 2314.1724137931033, 2569.4137931034484, 2824.655172413793, 3079.8965517241377, 3335.137931034483, 3590.3793103448274, 3845.6206896551726, 4100.862068965517, 4356.103448275862, 4611.3448275862065, 4866.586206896552, 5121.827586206897, 5377.068965517241, 5632.310344827586, 5887.551724137931, 6142.793103448275, 6398.0344827586205, 6653.275862068966, 6908.517241379311, 7163.758620689655, 7419.0], 'density': [0.22, 0.1, 0.10999999999999902, 0.049999999999999004, 0.08999999999999901, 0.07, 0.04, 0.039999999999999, 0.04, 0.06999999999999901, 0.01, 0.01, 0.0, 0.0, 0.0, 0.0, 0.01, 0.01, 0.01, 0.0, 0.030000000000000002, 0.039999999999999, 0.01, 0.0, 0.01, 0.0, 0.0, 0.0, 0.02, 0.01]}NaNNaN
Variance{'value': 0.0}{'value': 0.0}{'value': 0.0}{'value': 0.0}NaN{'value': 3509091.8299000002}NaNNaN
TypeMetric{'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}{'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}{'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}{'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}{'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}{'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}{'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}{'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}
FrequencyDistribution{'bins': [2015.0], 'frequency': [100]}{'bins': [1.0], 'frequency': [100]}{'bins': [1.0], 'frequency': [100]}{'bins': [4.0], 'frequency': [100]}NaN{'bins': [17.0, 272.2413793103448, 527.4827586206897, 782.7241379310344, 1037.9655172413793, 1293.2068965517242, 1548.4482758620688, 1803.6896551724137, 2058.9310344827586, 2314.1724137931033, 2569.4137931034484, 2824.655172413793, 3079.8965517241377, 3335.137931034483, 3590.3793103448274, 3845.6206896551726, 4100.862068965517, 4356.103448275862, 4611.3448275862065, 4866.586206896552, 5121.827586206897, 5377.068965517241, 5632.310344827586, 5887.551724137931, 6142.793103448275, 6398.0344827586205, 6653.275862068966, 6908.517241379311, 7163.758620689655, 7419.0], 'frequency': [22, 10, 11, 5, 9, 7, 4, 4, 4, 7, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 3, 4, 1, 0, 1, 0, 0, 0, 2, 1]}NaNNaN
Count{'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0}{'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0}{'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0}{'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0}{'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0}{'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0}{'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0}{'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0}
Max{'value': 2015.0}{'value': 1.0}{'value': 1.0}{'value': 4.0}NaN{'value': 7419.0}NaNNaN
DistinctCount{'value': 1}{'value': 1}{'value': 1}{'value': 1}{'value': 12}{'value': 96}{'value': 44}{'value': 29}
Sum{'value': 201500.0}{'value': 100.0}{'value': 100.0}{'value': 400.0}NaN{'value': 171151.0}NaNNaN
IsQuasiConstantFeature{'value': True}{'value': True}{'value': True}{'value': True}NaN{'value': False}NaNNaN
Quartiles{'q1': 2015.0, 'q2': 2015.0, 'q3': 2015.0}{'q1': 1.0, 'q2': 1.0, 'q3': 1.0}{'q1': 1.0, 'q2': 1.0, 'q3': 1.0}{'q1': 4.0, 'q2': 4.0, 'q3': 4.0}NaN{'q1': 371.0, 'q2': 1162.0, 'q3': 2276.0}NaNNaN
Mean{'value': 2015.0}{'value': 1.0}{'value': 1.0}{'value': 4.0}NaN{'value': 1711.5100000000002}NaNNaN
Kurtosis{'value': None}{'value': None}{'value': None}{'value': None}NaN{'value': 1.505850931533642}NaNNaN
TopKFrequentElementsNaNNaNNaNNaN{'value': [{'value': 'AA', 'estimate': 14, 'lower_bound': 14, 'upper_bound': 14}, {'value': 'B6', 'estimate': 12, 'lower_bound': 12, 'upper_bound': 12}, {'value': 'NK', 'estimate': 11, 'lower_bound': 11, 'upper_bound': 11}, {'value': 'UA', 'estimate': 11, 'lower_bound': 11, 'upper_bound': 11}, {'value': 'AS', 'estimate': 11, 'lower_bound': 11, 'upper_bound': 11}, {'value': 'DL', 'estimate': 11, 'lower_bound': 11, 'upper_bound': 11}, {'value': 'US', 'estimate': 8, 'lower_bound': 8, 'upper_bound': 8}, {'value': 'OO', 'estimate': 8, 'lower_bound': 8, 'upper_bound': 8}, {'value': 'EV', 'estimate': 7, 'lower_bound': 7, 'upper_bound': 7}, {'value': 'HA', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}]}NaN{'value': [{'value': 'ANC', 'estimate': 10, 'lower_bound': 10, 'upper_bound': 10}, {'value': 'LAS', 'estimate': 9, 'lower_bound': 9, 'upper_bound': 9}, {'value': 'SJU', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'LAX', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'SFO', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}, {'value': 'PHX', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}, {'value': 'SEA', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}, {'value': 'HNL', 'estimate': 4, 'lower_bound': 4, 'upper_bound': 4}, {'value': 'ORD', 'estimate': 4, 'lower_bound': 4, 'upper_bound': 4}, {'value': 'PDX', 'estimate': 3, 'lower_bound': 3, 'upper_bound': 3}]}{'value': [{'value': 'MIA', 'estimate': 10, 'lower_bound': 10, 'upper_bound': 10}, {'value': 'IAH', 'estimate': 10, 'lower_bound': 10, 'upper_bound': 10}, {'value': 'MSP', 'estimate': 9, 'lower_bound': 9, 'upper_bound': 9}, {'value': 'SEA', 'estimate': 9, 'lower_bound': 9, 'upper_bound': 9}, {'value': 'ATL', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'DFW', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'MCO', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'DEN', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}, {'value': 'PHX', 'estimate': 4, 'lower_bound': 4, 'upper_bound': 4}, {'value': 'CLT', 'estimate': 4, 'lower_bound': 4, 'upper_bound': 4}]}
DuplicateCountNaNNaNNaNNaN{'count': 88, 'percentage': 88.0}NaN{'count': 56, 'percentage': 56.00000000000001}{'count': 71, 'percentage': 71.0}
ModeNaNNaNNaNNaN{'value': ['AA']}NaN{'value': ['ANC']}{'value': ['MIA', 'IAH']}
\n", - "
" - ], - "text/plain": [ - " YEAR \\\n", - "Skewness {'value': None} \n", - "StandardDeviation {'value': 0.0} \n", - "Min {'value': 2015.0} \n", - "IsConstantFeature {'value': True} \n", - "IQR {'value': 0.0} \n", - "Range {'value': 0.0} \n", - "ProbabilityDistribution {'bins': [2015.0], 'density': [1.0]} \n", - "Variance {'value': 0.0} \n", - "TypeMetric {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", - "FrequencyDistribution {'bins': [2015.0], 'frequency': [100]} \n", - "Count {'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", - "Max {'value': 2015.0} \n", - "DistinctCount {'value': 1} \n", - "Sum {'value': 201500.0} \n", - "IsQuasiConstantFeature {'value': True} \n", - "Quartiles {'q1': 2015.0, 'q2': 2015.0, 'q3': 2015.0} \n", - "Mean {'value': 2015.0} \n", - "Kurtosis {'value': None} \n", - "TopKFrequentElements NaN \n", - "DuplicateCount NaN \n", - "Mode NaN \n", - "\n", - " MONTH \\\n", - "Skewness {'value': None} \n", - "StandardDeviation {'value': 0.0} \n", - "Min {'value': 1.0} \n", - "IsConstantFeature {'value': True} \n", - "IQR {'value': 0.0} \n", - "Range {'value': 0.0} \n", - "ProbabilityDistribution {'bins': [1.0], 'density': [1.0]} \n", - "Variance {'value': 0.0} \n", - "TypeMetric {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", - "FrequencyDistribution {'bins': [1.0], 'frequency': [100]} \n", - "Count {'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", - "Max {'value': 1.0} \n", - "DistinctCount {'value': 1} \n", - "Sum {'value': 100.0} \n", - "IsQuasiConstantFeature {'value': True} \n", - "Quartiles {'q1': 1.0, 'q2': 1.0, 'q3': 1.0} \n", - "Mean {'value': 1.0} \n", - "Kurtosis {'value': None} \n", - "TopKFrequentElements NaN \n", - "DuplicateCount NaN \n", - "Mode NaN \n", - "\n", - " DAY \\\n", - "Skewness {'value': None} \n", - "StandardDeviation {'value': 0.0} \n", - "Min {'value': 1.0} \n", - "IsConstantFeature {'value': True} \n", - "IQR {'value': 0.0} \n", - "Range {'value': 0.0} \n", - "ProbabilityDistribution {'bins': [1.0], 'density': [1.0]} \n", - "Variance {'value': 0.0} \n", - "TypeMetric {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", - "FrequencyDistribution {'bins': [1.0], 'frequency': [100]} \n", - "Count {'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", - "Max {'value': 1.0} \n", - "DistinctCount {'value': 1} \n", - "Sum {'value': 100.0} \n", - "IsQuasiConstantFeature {'value': True} \n", - "Quartiles {'q1': 1.0, 'q2': 1.0, 'q3': 1.0} \n", - "Mean {'value': 1.0} \n", - "Kurtosis {'value': None} \n", - "TopKFrequentElements NaN \n", - "DuplicateCount NaN \n", - "Mode NaN \n", - "\n", - " DAY_OF_WEEK \\\n", - "Skewness {'value': None} \n", - "StandardDeviation {'value': 0.0} \n", - "Min {'value': 4.0} \n", - "IsConstantFeature {'value': True} \n", - "IQR {'value': 0.0} \n", - "Range {'value': 0.0} \n", - "ProbabilityDistribution {'bins': [4.0], 'density': [1.0]} \n", - "Variance {'value': 0.0} \n", - "TypeMetric {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", - "FrequencyDistribution {'bins': [4.0], 'frequency': [100]} \n", - "Count {'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", - "Max {'value': 4.0} \n", - "DistinctCount {'value': 1} \n", - "Sum {'value': 400.0} \n", - "IsQuasiConstantFeature {'value': True} \n", - "Quartiles {'q1': 4.0, 'q2': 4.0, 'q3': 4.0} \n", - "Mean {'value': 4.0} \n", - "Kurtosis {'value': None} \n", - "TopKFrequentElements NaN \n", - "DuplicateCount NaN \n", - "Mode NaN \n", - "\n", - " AIRLINE \\\n", - "Skewness NaN \n", - "StandardDeviation NaN \n", - "Min NaN \n", - "IsConstantFeature NaN \n", - "IQR NaN \n", - "Range NaN \n", - "ProbabilityDistribution NaN \n", - "Variance NaN \n", - "TypeMetric {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", - "FrequencyDistribution NaN \n", - "Count {'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", - "Max NaN \n", - "DistinctCount {'value': 12} \n", - "Sum NaN \n", - "IsQuasiConstantFeature NaN \n", - "Quartiles NaN \n", - "Mean NaN \n", - "Kurtosis NaN \n", - "TopKFrequentElements {'value': [{'value': 'AA', 'estimate': 14, 'lower_bound': 14, 'upper_bound': 14}, {'value': 'B6', 'estimate': 12, 'lower_bound': 12, 'upper_bound': 12}, {'value': 'NK', 'estimate': 11, 'lower_bound': 11, 'upper_bound': 11}, {'value': 'UA', 'estimate': 11, 'lower_bound': 11, 'upper_bound': 11}, {'value': 'AS', 'estimate': 11, 'lower_bound': 11, 'upper_bound': 11}, {'value': 'DL', 'estimate': 11, 'lower_bound': 11, 'upper_bound': 11}, {'value': 'US', 'estimate': 8, 'lower_bound': 8, 'upper_bound': 8}, {'value': 'OO', 'estimate': 8, 'lower_bound': 8, 'upper_bound': 8}, {'value': 'EV', 'estimate': 7, 'lower_bound': 7, 'upper_bound': 7}, {'value': 'HA', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}]} \n", - "DuplicateCount {'count': 88, 'percentage': 88.0} \n", - "Mode {'value': ['AA']} \n", - "\n", - " FLIGHT_NUMBER \\\n", - "Skewness {'value': 1.545298800400988} \n", - "StandardDeviation {'value': 1873.257011170651} \n", - "Min {'value': 17.0} \n", - "IsConstantFeature {'value': False} \n", - "IQR {'value': 1905.0} \n", - "Range {'value': 7402.0} \n", - "ProbabilityDistribution {'bins': [17.0, 272.2413793103448, 527.4827586206897, 782.7241379310344, 1037.9655172413793, 1293.2068965517242, 1548.4482758620688, 1803.6896551724137, 2058.9310344827586, 2314.1724137931033, 2569.4137931034484, 2824.655172413793, 3079.8965517241377, 3335.137931034483, 3590.3793103448274, 3845.6206896551726, 4100.862068965517, 4356.103448275862, 4611.3448275862065, 4866.586206896552, 5121.827586206897, 5377.068965517241, 5632.310344827586, 5887.551724137931, 6142.793103448275, 6398.0344827586205, 6653.275862068966, 6908.517241379311, 7163.758620689655, 7419.0], 'density': [0.22, 0.1, 0.10999999999999902, 0.049999999999999004, 0.08999999999999901, 0.07, 0.04, 0.039999999999999, 0.04, 0.06999999999999901, 0.01, 0.01, 0.0, 0.0, 0.0, 0.0, 0.01, 0.01, 0.01, 0.0, 0.030000000000000002, 0.039999999999999, 0.01, 0.0, 0.01, 0.0, 0.0, 0.0, 0.02, 0.01]} \n", - "Variance {'value': 3509091.8299000002} \n", - "TypeMetric {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", - "FrequencyDistribution {'bins': [17.0, 272.2413793103448, 527.4827586206897, 782.7241379310344, 1037.9655172413793, 1293.2068965517242, 1548.4482758620688, 1803.6896551724137, 2058.9310344827586, 2314.1724137931033, 2569.4137931034484, 2824.655172413793, 3079.8965517241377, 3335.137931034483, 3590.3793103448274, 3845.6206896551726, 4100.862068965517, 4356.103448275862, 4611.3448275862065, 4866.586206896552, 5121.827586206897, 5377.068965517241, 5632.310344827586, 5887.551724137931, 6142.793103448275, 6398.0344827586205, 6653.275862068966, 6908.517241379311, 7163.758620689655, 7419.0], 'frequency': [22, 10, 11, 5, 9, 7, 4, 4, 4, 7, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 3, 4, 1, 0, 1, 0, 0, 0, 2, 1]} \n", - "Count {'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", - "Max {'value': 7419.0} \n", - "DistinctCount {'value': 96} \n", - "Sum {'value': 171151.0} \n", - "IsQuasiConstantFeature {'value': False} \n", - "Quartiles {'q1': 371.0, 'q2': 1162.0, 'q3': 2276.0} \n", - "Mean {'value': 1711.5100000000002} \n", - "Kurtosis {'value': 1.505850931533642} \n", - "TopKFrequentElements NaN \n", - "DuplicateCount NaN \n", - "Mode NaN \n", - "\n", - " ORIGIN_AIRPORT \\\n", - "Skewness NaN \n", - "StandardDeviation NaN \n", - "Min NaN \n", - "IsConstantFeature NaN \n", - "IQR NaN \n", - "Range NaN \n", - "ProbabilityDistribution NaN \n", - "Variance NaN \n", - "TypeMetric {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", - "FrequencyDistribution NaN \n", - "Count {'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", - "Max NaN \n", - "DistinctCount {'value': 44} \n", - "Sum NaN \n", - "IsQuasiConstantFeature NaN \n", - "Quartiles NaN \n", - "Mean NaN \n", - "Kurtosis NaN \n", - "TopKFrequentElements {'value': [{'value': 'ANC', 'estimate': 10, 'lower_bound': 10, 'upper_bound': 10}, {'value': 'LAS', 'estimate': 9, 'lower_bound': 9, 'upper_bound': 9}, {'value': 'SJU', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'LAX', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'SFO', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}, {'value': 'PHX', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}, {'value': 'SEA', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}, {'value': 'HNL', 'estimate': 4, 'lower_bound': 4, 'upper_bound': 4}, {'value': 'ORD', 'estimate': 4, 'lower_bound': 4, 'upper_bound': 4}, {'value': 'PDX', 'estimate': 3, 'lower_bound': 3, 'upper_bound': 3}]} \n", - "DuplicateCount {'count': 56, 'percentage': 56.00000000000001} \n", - "Mode {'value': ['ANC']} \n", - "\n", - " DESTINATION_AIRPORT \n", - "Skewness NaN \n", - "StandardDeviation NaN \n", - "Min NaN \n", - "IsConstantFeature NaN \n", - "IQR NaN \n", - "Range NaN \n", - "ProbabilityDistribution NaN \n", - "Variance NaN \n", - "TypeMetric {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0} \n", - "FrequencyDistribution NaN \n", - "Count {'total_count': 100, 'missing_count': 0, 'missing_count_percentage': 0.0} \n", - "Max NaN \n", - "DistinctCount {'value': 29} \n", - "Sum NaN \n", - "IsQuasiConstantFeature NaN \n", - "Quartiles NaN \n", - "Mean NaN \n", - "Kurtosis NaN \n", - "TopKFrequentElements {'value': [{'value': 'MIA', 'estimate': 10, 'lower_bound': 10, 'upper_bound': 10}, {'value': 'IAH', 'estimate': 10, 'lower_bound': 10, 'upper_bound': 10}, {'value': 'MSP', 'estimate': 9, 'lower_bound': 9, 'upper_bound': 9}, {'value': 'SEA', 'estimate': 9, 'lower_bound': 9, 'upper_bound': 9}, {'value': 'ATL', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'DFW', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'MCO', 'estimate': 6, 'lower_bound': 6, 'upper_bound': 6}, {'value': 'DEN', 'estimate': 5, 'lower_bound': 5, 'upper_bound': 5}, {'value': 'PHX', 'estimate': 4, 'lower_bound': 4, 'upper_bound': 4}, {'value': 'CLT', 'estimate': 4, 'lower_bound': 4, 'upper_bound': 4}]} \n", - "DuplicateCount {'count': 71, 'percentage': 71.0} \n", - "Mode {'value': ['MIA', 'IAH']} " - ] - }, - "execution_count": 32, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "feature_group_flights.get_statistics().to_pandas()" ] }, { "cell_type": "code", - "execution_count": 33, - "id": "7cfc56fe", + "execution_count": null, + "id": "64cc1014", + "metadata": {}, + "outputs": [], + "source": [ + "feature_group_airlines.get_statistics().to_viz()" + ] + }, + { + "cell_type": "markdown", + "id": "6cb585d5", + "metadata": {}, + "source": [ + "You can call the `get_validation_output()` method of the FeatureGroup instance to fetch validation results for a specific ingestion job." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a13fc434", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
0
results[{'exception_info': {'raised_exception': False, 'exception_traceback': None, 'exception_message': None}, 'meta': {}, 'result': {'element_count': 14, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'partial_unexpected_list': []}, 'expectation_config': {'meta': {}, 'expectation_type': 'expect_column_values_to_not_be_null', 'kwargs': {'column': 'IATA_CODE', 'batch_id': '90bbaf1a6a4ae45a238e05e0d240a033'}}, 'success': True}]
successTrue
meta.great_expectations_version0.15.39
meta.expectation_suite_nameairlines_feature_group
meta.run_id.run_time2023-07-14T04:30:58.945832+00:00
meta.run_id.run_nameNone
meta.batch_markers.ge_load_time20230714T043058.944828Z
meta.active_batch_definition.datasource_namefeature-ingestion-pipeline
meta.active_batch_definition.data_connector_namefeature-ingestion-pipeline
meta.active_batch_definition.data_asset_namefeature-ingestion-pipeline
meta.active_batch_definition.batch_identifiers.ge_batch_id3b3f551a-21ff-11ee-9023-0242ac130002
meta.validation_time20230714T043058.945751Z
meta.checkpoint_nameNone
statistics.evaluated_expectations1
statistics.successful_expectations1
statistics.unsuccessful_expectations0
statistics.success_percent100.0
\n", - "
" - ], - "text/plain": [ - " 0\n", - "results [{'exception_info': {'raised_exception': False, 'exception_traceback': None, 'exception_message': None}, 'meta': {}, 'result': {'element_count': 14, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'partial_unexpected_list': []}, 'expectation_config': {'meta': {}, 'expectation_type': 'expect_column_values_to_not_be_null', 'kwargs': {'column': 'IATA_CODE', 'batch_id': '90bbaf1a6a4ae45a238e05e0d240a033'}}, 'success': True}]\n", - "success True\n", - "meta.great_expectations_version 0.15.39\n", - "meta.expectation_suite_name airlines_feature_group\n", - "meta.run_id.run_time 2023-07-14T04:30:58.945832+00:00\n", - "meta.run_id.run_name None\n", - "meta.batch_markers.ge_load_time 20230714T043058.944828Z\n", - "meta.active_batch_definition.datasource_name feature-ingestion-pipeline\n", - "meta.active_batch_definition.data_connector_name feature-ingestion-pipeline\n", - "meta.active_batch_definition.data_asset_name feature-ingestion-pipeline\n", - "meta.active_batch_definition.batch_identifiers.ge_batch_id 3b3f551a-21ff-11ee-9023-0242ac130002\n", - "meta.validation_time 20230714T043058.945751Z\n", - "meta.checkpoint_name None\n", - "statistics.evaluated_expectations 1\n", - "statistics.successful_expectations 1\n", - "statistics.unsuccessful_expectations 0\n", - "statistics.success_percent 100.0" - ] - }, - "execution_count": 33, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "feature_group_airlines.get_validation_output().to_pandas()" ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "c219e3f9", + "metadata": {}, + "outputs": [], + "source": [ + "feature_group_airlines.get_validation_output().to_summary()" + ] + }, { "cell_type": "markdown", - "id": "a84c1e68", + "id": "e301ded3", "metadata": { "pycharm": { "name": "#%% md\n" @@ -3658,51 +1025,21 @@ }, { "cell_type": "code", - "execution_count": 34, - "id": "d81ddcbb", + "execution_count": null, + "id": "66194d26", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+\n", - "|IATA_CODE|\n", - "+---------+\n", - "| NK|\n", - "| WN|\n", - "| DL|\n", - "| EV|\n", - "| HA|\n", - "| MQ|\n", - "| VX|\n", - "| UA|\n", - "| AA|\n", - "| US|\n", - "+---------+\n", - "only showing top 10 rows\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - } - ], + "outputs": [], "source": [ "feature_group_airlines.select(['IATA_CODE']).show()" ] }, { "cell_type": "markdown", - "id": "2416e2a2", + "id": "dd80ceb0", "metadata": { "pycharm": { "name": "#%% md\n" @@ -3715,48 +1052,21 @@ }, { "cell_type": "code", - "execution_count": 35, - "id": "19267e79", + "execution_count": null, + "id": "aa4cc044", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[Stage 50:> (0 + 1) / 1]\r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+--------------------+\n", - "|IATA_CODE| AIRLINE|\n", - "+---------+--------------------+\n", - "| EV|Atlantic Southeas...|\n", - "+---------+--------------------+\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - } - ], + "outputs": [], "source": [ "feature_group_airlines.filter(feature_group_airlines.IATA_CODE == \"EV\").show()" ] }, { "cell_type": "markdown", - "id": "22da4132", + "id": "f885a179", "metadata": { "pycharm": { "name": "#%% md\n" @@ -3770,46 +1080,14 @@ }, { "cell_type": "code", - "execution_count": 36, - "id": "212c1750", + "execution_count": null, + "id": "526997d3", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+-------+----+-----+-------+--------+---------+\n", - "|YEAR|MONTH|DAY|DAY_OF_WEEK|AIRLINE|FLIGHT_NUMBER|ORIGIN_AIRPORT|DESTINATION_AIRPORT|IATA_CODE|AIRPORT|CITY|STATE|COUNTRY|LATITUDE|LONGITUDE|\n", - "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+-------+----+-----+-------+--------+---------+\n", - "|2015| 1| 1| 4| B6| 1030| BQN| MCO| null| null|null| null| null| null| null|\n", - "|2015| 1| 1| 4| B6| 262| SJU| BOS| null| null|null| null| null| null| null|\n", - "|2015| 1| 1| 4| B6| 2134| SJU| MCO| null| null|null| null| null| null| null|\n", - "|2015| 1| 1| 4| B6| 730| BQN| MCO| null| null|null| null| null| null| null|\n", - "|2015| 1| 1| 4| B6| 768| PSE| MCO| null| null|null| null| null| null| null|\n", - "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+-------+----+-----+-------+--------+---------+\n", - "only showing top 5 rows\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - } - ], + "outputs": [], "source": [ "from ads.feature_store.common.enums import JoinType\n", "\n", @@ -3823,32 +1101,21 @@ }, { "cell_type": "code", - "execution_count": 37, - "id": "7690e2a4", + "execution_count": null, + "id": "22dbfa74", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/plain": [ - "'SELECT fg_2.YEAR YEAR, fg_2.MONTH MONTH, fg_2.DAY DAY, fg_2.DAY_OF_WEEK DAY_OF_WEEK, fg_2.AIRLINE AIRLINE, fg_2.FLIGHT_NUMBER FLIGHT_NUMBER, fg_2.ORIGIN_AIRPORT ORIGIN_AIRPORT, fg_2.DESTINATION_AIRPORT DESTINATION_AIRPORT, fg_0.IATA_CODE IATA_CODE, fg_1.AIRPORT AIRPORT, fg_1.CITY CITY, fg_1.STATE STATE, fg_1.COUNTRY COUNTRY, fg_1.LATITUDE LATITUDE, fg_1.LONGITUDE LONGITUDE FROM `843E320A28F319748425787F04BCD3B8`.flights_feature_group fg_2 LEFT JOIN `843E320A28F319748425787F04BCD3B8`.airlines_feature_group fg_0 ON fg_2.ORIGIN_AIRPORT = fg_0.IATA_CODE LEFT JOIN `843E320A28F319748425787F04BCD3B8`.airport_feature_group fg_1 ON fg_0.AIRLINE = fg_1.IATA_CODE'" - ] - }, - "execution_count": 37, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "query.to_string()" ] }, { "cell_type": "markdown", - "id": "a249cbe0", + "id": "9b903c93", "metadata": { "pycharm": { "name": "#%% md\n" @@ -3856,7 +1123,7 @@ }, "source": [ "\n", - "### 3.7 Create dataset\n", + "### 3.7 Dataset\n", "A dataset is a collection of feature snapshots that are joined together to either train a model or perform model inference.\n", "\n", "
\n", @@ -3866,8 +1133,8 @@ }, { "cell_type": "code", - "execution_count": 38, - "id": "ad857582", + "execution_count": null, + "id": "5f060a15", "metadata": { "pycharm": { "name": "#%%\n" @@ -3888,7 +1155,7 @@ }, { "cell_type": "markdown", - "id": "c61e568b", + "id": "77d3f2a9", "metadata": { "pycharm": { "name": "#%% md\n" @@ -3903,917 +1170,62 @@ }, { "cell_type": "code", - "execution_count": 39, - "id": "ca7becdf", + "execution_count": null, + "id": "3d95cf4c", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/plain": [ - "\n", - "kind: Dataset\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", - " description: Combined dataset for flights\n", - " entityId: 843E320A28F319748425787F04BCD3B8\n", - " featureStoreId: 751D665EB6AE7360928F15705F9F0F48\n", - " id: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " name: flights_dataset\n", - " query: SELECT fg_2.YEAR YEAR, fg_2.MONTH MONTH, fg_2.DAY DAY, fg_2.DAY_OF_WEEK DAY_OF_WEEK,\n", - " fg_2.AIRLINE AIRLINE, fg_2.FLIGHT_NUMBER FLIGHT_NUMBER, fg_2.ORIGIN_AIRPORT ORIGIN_AIRPORT,\n", - " fg_2.DESTINATION_AIRPORT DESTINATION_AIRPORT, fg_0.IATA_CODE IATA_CODE, fg_1.AIRPORT\n", - " AIRPORT, fg_1.CITY CITY, fg_1.STATE STATE, fg_1.COUNTRY COUNTRY, fg_1.LATITUDE\n", - " LATITUDE, fg_1.LONGITUDE LONGITUDE FROM `843E320A28F319748425787F04BCD3B8`.flights_feature_group\n", - " fg_2 LEFT JOIN `843E320A28F319748425787F04BCD3B8`.airlines_feature_group fg_0\n", - " ON fg_2.ORIGIN_AIRPORT = fg_0.IATA_CODE LEFT JOIN `843E320A28F319748425787F04BCD3B8`.airport_feature_group\n", - " fg_1 ON fg_0.AIRLINE = fg_1.IATA_CODE\n", - " statisticsConfig:\n", - " isEnabled: true\n", - "type: dataset" - ] - }, - "execution_count": 39, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "dataset.create()" ] }, { "cell_type": "code", - "execution_count": 40, - "id": "597e3dd1", + "execution_count": null, + "id": "aaf8c3b4", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:mlm_insights.builder:validating required components\n", - "INFO:mlm_insights.builder:required components validated\n", - "INFO:mlm_insights.builder.usage:Activating Minimal Insights Usage\n", - "INFO:mlm_insights.builder:Generating Runner object\n", - "INFO:mlm_insights.builder:Generating workflow request\n", - "INFO:mlm_insights.workflow:Fetching engine object\n", - "INFO:mlm_insights.workflow:Returning native engine object\n", - "INFO:mlm_insights.builder:Running Fugue Workflow\n", - "INFO:mlm_insights.workflow:Executing Fugue Workflow\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", - "\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", - " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", - " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", - " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", - " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.\n", - " return _methods._mean(a, axis=axis, dtype=dtype,\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide\n", - " ret = ret.dtype.type(ret / rcount)\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.\n", - " return _methods._mean(a, axis=axis, dtype=dtype,\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide\n", - " ret = ret.dtype.type(ret / rcount)\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", - " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", - " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", - " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/mlm_insights/core/sfcs/descriptive_statistics_sfc.py:80: RuntimeWarning: Precision loss occurred in moment calculation due to catastrophic cancellation. This occurs when the data are nearly identical. Results may be unreliable.\n", - " self.central_moments = [moment(column, moment=i) for i in range(MAXIMUM_MOMENT_ORDER + 1)]\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.\n", - " return _methods._mean(a, axis=axis, dtype=dtype,\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide\n", - " ret = ret.dtype.type(ret / rcount)\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.\n", - " return _methods._mean(a, axis=axis, dtype=dtype,\n", - "/home/datascience/conda/fspyspark32_p38_cpu#conda_v1/lib/python3.8/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide\n", - " ret = ret.dtype.type(ret / rcount)\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c31570>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=2015.0, minimum=2015.0, maximum=2015.0, central_moments=[1.0, 0.0, 0.0, 0.0, 0.0]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c31270>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9589030>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9590f30>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=1.0, minimum=1.0, maximum=1.0, central_moments=[1.0, 0.0, 0.0, 0.0, 0.0]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef95613f0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef95842f0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef95909b0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=1.0, minimum=1.0, maximum=1.0, central_moments=[1.0, 0.0, 0.0, 0.0, 0.0]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9590e30>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9c1f1b0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c1f330>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=4.0, minimum=4.0, maximum=4.0, central_moments=[1.0, 0.0, 0.0, 0.0, 0.0]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c1fc30>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9c1f8f0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c1f2b0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c1fdb0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c1f3b0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=100.0, mean=1711.51, minimum=17.0, maximum=7419.0, central_moments=[1.0, 0.0, 3509091.8299000002, 10157914842.877602, 55483811382672.16]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c1f470>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9c1fd70>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c1f630>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c1fb30>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c1f9b0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c1fbb0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c23ab0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c23470>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c235f0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c43670>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c435f0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c43c70>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c43bf0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c43f70>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c3f4b0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9c3ff70>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9c3f630>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=0.0, mean=nan, minimum=nan, maximum=nan, central_moments=[nan, nan, nan, nan, nan]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9502870>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9526770>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f8ef9526d70>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=0.0, mean=nan, minimum=nan, maximum=nan, central_moments=[nan, nan, nan, nan, nan]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f8ef9526df0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f8ef9526070>)} sfc map\n", - "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", - "INFO:mlm_insights.builder:Profile Generated Successfully\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 2015.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [2015.0], 'density': [1.0]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [2015.0], 'frequency': [100]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 2015.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 201500.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 2015.0, 'q2': 2015.0, 'q3': 2015.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 2015.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 1.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [1.0], 'density': [1.0]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [1.0], 'frequency': [100]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 1.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 100.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 1.0, 'q2': 1.0, 'q3': 1.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 1.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 1.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [1.0], 'density': [1.0]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [1.0], 'frequency': [100]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 1.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 100.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 1.0, 'q2': 1.0, 'q3': 1.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 1.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 4.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [4.0], 'density': [1.0]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [4.0], 'frequency': [100]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 4.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 400.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 4.0, 'q2': 4.0, 'q3': 4.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 4.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='AA', estimate=14, lower_bound=14, upper_bound=14), FrequentItemEstimate(value='B6', estimate=12, lower_bound=12, upper_bound=12), FrequentItemEstimate(value='UA', estimate=11, lower_bound=11, upper_bound=11), FrequentItemEstimate(value='AS', estimate=11, lower_bound=11, upper_bound=11), FrequentItemEstimate(value='NK', estimate=11, lower_bound=11, upper_bound=11), FrequentItemEstimate(value='DL', estimate=11, lower_bound=11, upper_bound=11), FrequentItemEstimate(value='US', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='OO', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='EV', estimate=7, lower_bound=7, upper_bound=7), FrequentItemEstimate(value='HA', estimate=5, lower_bound=5, upper_bound=5)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 12.000000327825557 in Distinct count SFC, upper bound = 12.000599478849342, lower bound = 12.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 88, 'percentage': 88.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['AA']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 12.000000327825557 in Distinct count SFC, upper bound = 12.000599478849342, lower bound = 12.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 12.000000327825557\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: 1.5452988004009884\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 1873.257011170651\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 17.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 1905.0\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 7402.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [17.0, 272.2413793103448, 527.4827586206897, 782.7241379310344, 1037.9655172413793, 1293.2068965517242, 1548.4482758620688, 1803.6896551724137, 2058.9310344827586, 2314.1724137931033, 2569.4137931034484, 2824.655172413793, 3079.8965517241377, 3335.137931034483, 3590.3793103448274, 3845.6206896551726, 4100.862068965517, 4356.103448275862, 4611.3448275862065, 4866.586206896552, 5121.827586206897, 5377.068965517241, 5632.310344827586, 5887.551724137931, 6142.793103448275, 6398.0344827586205, 6653.275862068966, 6908.517241379311, 7163.758620689655, 7419.0], 'density': [0.22, 0.1, 0.10999999999999999, 0.04999999999999999, 0.08999999999999997, 0.07000000000000006, 0.040000000000000036, 0.039999999999999925, 0.040000000000000036, 0.06999999999999995, 0.010000000000000009, 0.010000000000000009, 0.0, 0.0, 0.0, 0.0, 0.010000000000000009, 0.010000000000000009, 0.010000000000000009, 0.0, 0.030000000000000027, 0.039999999999999925, 0.010000000000000009, 0.0, 0.010000000000000009, 0.0, 0.0, 0.0, 0.020000000000000018, 0.010000000000000009]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 3509091.8299000002\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 100, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [17.0, 272.2413793103448, 527.4827586206897, 782.7241379310344, 1037.9655172413793, 1293.2068965517242, 1548.4482758620688, 1803.6896551724137, 2058.9310344827586, 2314.1724137931033, 2569.4137931034484, 2824.655172413793, 3079.8965517241377, 3335.137931034483, 3590.3793103448274, 3845.6206896551726, 4100.862068965517, 4356.103448275862, 4611.3448275862065, 4866.586206896552, 5121.827586206897, 5377.068965517241, 5632.310344827586, 5887.551724137931, 6142.793103448275, 6398.0344827586205, 6653.275862068966, 6908.517241379311, 7163.758620689655, 7419.0], 'frequency': [22, 10, 11, 5, 9, 7, 4, 4, 4, 7, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 3, 4, 1, 0, 1, 0, 0, 0, 2, 1]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 7419.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 96.00002264977122 in Distinct count SFC, upper bound = 96.00481585896145, lower bound = 96.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 96.00002264977122\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 171151.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 371.0, 'q2': 1162.0, 'q3': 2276.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 1711.51\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.5058509315336428\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='ANC', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='LAS', estimate=9, lower_bound=9, upper_bound=9), FrequentItemEstimate(value='SJU', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='LAX', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='SFO', estimate=5, lower_bound=5, upper_bound=5), FrequentItemEstimate(value='PHX', estimate=5, lower_bound=5, upper_bound=5), FrequentItemEstimate(value='SEA', estimate=5, lower_bound=5, upper_bound=5), FrequentItemEstimate(value='HNL', estimate=4, lower_bound=4, upper_bound=4), FrequentItemEstimate(value='ORD', estimate=4, lower_bound=4, upper_bound=4), FrequentItemEstimate(value='PDX', estimate=3, lower_bound=3, upper_bound=3)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 44.000004698833415 in Distinct count SFC, upper bound = 44.00220158609522, lower bound = 44.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 56, 'percentage': 56.00000000000001}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['ANC']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 44.000004698833415 in Distinct count SFC, upper bound = 44.00220158609522, lower bound = 44.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 44.000004698833415\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='IAH', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='MIA', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='SEA', estimate=9, lower_bound=9, upper_bound=9), FrequentItemEstimate(value='MSP', estimate=9, lower_bound=9, upper_bound=9), FrequentItemEstimate(value='ATL', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='DFW', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='MCO', estimate=6, lower_bound=6, upper_bound=6), FrequentItemEstimate(value='DEN', estimate=5, lower_bound=5, upper_bound=5), FrequentItemEstimate(value='CLT', estimate=4, lower_bound=4, upper_bound=4), FrequentItemEstimate(value='PHX', estimate=4, lower_bound=4, upper_bound=4)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 100, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 29.00000201662398 in Distinct count SFC, upper bound = 29.00144996499259, lower bound = 29.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 71, 'percentage': 71.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['IAH', 'MIA']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 29.00000201662398 in Distinct count SFC, upper bound = 29.00144996499259, lower bound = 29.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 29.00000201662398\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 100.0, 'missing_count_percentage': 100.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 0.0\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 100.0, 'missing_count_percentage': 100.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 0.0\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 100.0, 'missing_count_percentage': 100.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 0.0\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 100.0, 'missing_count_percentage': 100.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 0.0\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 100.0, 'missing_count_percentage': 100.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 100.0, 'missing_count_percentage': 100.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 0.0\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 100.0, 'missing_count': 100.0, 'missing_count_percentage': 100.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 0.0 in Distinct count SFC, upper bound = 0.0, lower bound = 0.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 0.0\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 0.0\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: None\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 100.0\n", - "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", - "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", - "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", - "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", - "│ 6881C3E17FC9BBB02934BB7B6B9068D1 │ DATASET │ Succeeded │ None │\n", - "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" - ] - } - ], + "outputs": [], "source": [ "dataset.materialise()" ] }, { "cell_type": "markdown", - "id": "2b775d67", + "id": "d1ea299d", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ - "### Interoperability with model" + "#### Interoperability with model" ] }, { "cell_type": "code", - "execution_count": 41, - "id": "f14c80c4", + "execution_count": null, + "id": "7a3d4e72", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "text/plain": [ - "\n", - "kind: Dataset\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", - " description: Combined dataset for flights\n", - " entityId: 843E320A28F319748425787F04BCD3B8\n", - " featureStoreId: 751D665EB6AE7360928F15705F9F0F48\n", - " id: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " jobId: f8a347ca-db9a-4ba6-adbf-c3a5f0c61441\n", - " modelDetails:\n", - " items:\n", - " - ocid1.modelcatalog.oc1.unique_ocid\n", - " name: flights_dataset\n", - " outputFeatureDetails:\n", - " items:\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: LONG\n", - " name: YEAR\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: LONG\n", - " name: MONTH\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: LONG\n", - " name: DAY\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: LONG\n", - " name: DAY_OF_WEEK\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: STRING\n", - " name: AIRLINE\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: LONG\n", - " name: FLIGHT_NUMBER\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: STRING\n", - " name: ORIGIN_AIRPORT\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: STRING\n", - " name: DESTINATION_AIRPORT\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: STRING\n", - " name: IATA_CODE\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: STRING\n", - " name: AIRPORT\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: STRING\n", - " name: CITY\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: STRING\n", - " name: STATE\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: STRING\n", - " name: COUNTRY\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: DOUBLE\n", - " name: LATITUDE\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: DOUBLE\n", - " name: LONGITUDE\n", - " query: SELECT fg_2.YEAR YEAR, fg_2.MONTH MONTH, fg_2.DAY DAY, fg_2.DAY_OF_WEEK DAY_OF_WEEK,\n", - " fg_2.AIRLINE AIRLINE, fg_2.FLIGHT_NUMBER FLIGHT_NUMBER, fg_2.ORIGIN_AIRPORT ORIGIN_AIRPORT,\n", - " fg_2.DESTINATION_AIRPORT DESTINATION_AIRPORT, fg_0.IATA_CODE IATA_CODE, fg_1.AIRPORT\n", - " AIRPORT, fg_1.CITY CITY, fg_1.STATE STATE, fg_1.COUNTRY COUNTRY, fg_1.LATITUDE\n", - " LATITUDE, fg_1.LONGITUDE LONGITUDE FROM `843E320A28F319748425787F04BCD3B8`.flights_feature_group\n", - " fg_2 LEFT JOIN `843E320A28F319748425787F04BCD3B8`.airlines_feature_group fg_0\n", - " ON fg_2.ORIGIN_AIRPORT = fg_0.IATA_CODE LEFT JOIN `843E320A28F319748425787F04BCD3B8`.airport_feature_group\n", - " fg_1 ON fg_0.AIRLINE = fg_1.IATA_CODE\n", - " statisticsConfig:\n", - " isEnabled: true\n", - "type: dataset" - ] - }, - "execution_count": 41, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "model_details = ModelDetails().with_items([\"ocid1.modelcatalog.oc1.unique_ocid\"])\n", - "dataset.with_model_details(model_details)" - ] - }, - { - "cell_type": "code", - "execution_count": 42, - "id": "8b5d9b08", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "\n", - "kind: Dataset\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", - " description: Combined dataset for flights\n", - " entityId: 843E320A28F319748425787F04BCD3B8\n", - " featureStoreId: 751D665EB6AE7360928F15705F9F0F48\n", - " id: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " jobId: f8a347ca-db9a-4ba6-adbf-c3a5f0c61441\n", - " modelDetails:\n", - " items:\n", - " - ocid1.modelcatalog.oc1.unique_ocid\n", - " name: flights_dataset\n", - " outputFeatureDetails:\n", - " items:\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: LONG\n", - " name: YEAR\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: LONG\n", - " name: MONTH\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: LONG\n", - " name: DAY\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: LONG\n", - " name: DAY_OF_WEEK\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: STRING\n", - " name: AIRLINE\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: LONG\n", - " name: FLIGHT_NUMBER\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: STRING\n", - " name: ORIGIN_AIRPORT\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: STRING\n", - " name: DESTINATION_AIRPORT\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: STRING\n", - " name: IATA_CODE\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: STRING\n", - " name: AIRPORT\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: STRING\n", - " name: CITY\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: STRING\n", - " name: STATE\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: STRING\n", - " name: COUNTRY\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: DOUBLE\n", - " name: LATITUDE\n", - " - datasetId: 6881C3E17FC9BBB02934BB7B6B9068D1\n", - " featureType: DOUBLE\n", - " name: LONGITUDE\n", - " query: SELECT fg_2.YEAR YEAR, fg_2.MONTH MONTH, fg_2.DAY DAY, fg_2.DAY_OF_WEEK DAY_OF_WEEK,\n", - " fg_2.AIRLINE AIRLINE, fg_2.FLIGHT_NUMBER FLIGHT_NUMBER, fg_2.ORIGIN_AIRPORT ORIGIN_AIRPORT,\n", - " fg_2.DESTINATION_AIRPORT DESTINATION_AIRPORT, fg_0.IATA_CODE IATA_CODE, fg_1.AIRPORT\n", - " AIRPORT, fg_1.CITY CITY, fg_1.STATE STATE, fg_1.COUNTRY COUNTRY, fg_1.LATITUDE\n", - " LATITUDE, fg_1.LONGITUDE LONGITUDE FROM `843E320A28F319748425787F04BCD3B8`.flights_feature_group\n", - " fg_2 LEFT JOIN `843E320A28F319748425787F04BCD3B8`.airlines_feature_group fg_0\n", - " ON fg_2.ORIGIN_AIRPORT = fg_0.IATA_CODE LEFT JOIN `843E320A28F319748425787F04BCD3B8`.airport_feature_group\n", - " fg_1 ON fg_0.AIRLINE = fg_1.IATA_CODE\n", - " statisticsConfig:\n", - " isEnabled: true\n", - "type: dataset" - ] - }, - "execution_count": 42, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dataset.update()" + "dataset.add_models(model_details)" ] }, { "cell_type": "markdown", - "id": "ba077d02", + "id": "40efc4ab", "metadata": { "pycharm": { "name": "#%% md\n" @@ -4821,210 +1233,56 @@ }, "source": [ "\n", - "##### Visualise lineage\n", + "#### Visualise lineage\n", "\n", "Use the ```.show()``` method on the Dataset instance to visualize the lineage of the dataset." ] }, { "cell_type": "code", - "execution_count": 43, - "id": "ad764d69", + "execution_count": null, + "id": "e533a24a", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "%3\n", - "\n", - "\n", - "751D665EB6AE7360928F15705F9F0F48\n", - "\n", - "flights details\n", - "Feature Store\n", - "751D665EB6AE7360928F15705F9F0F48\n", - "\n", - "\n", - "843E320A28F319748425787F04BCD3B8\n", - "\n", - "Flight details2\n", - "Entity\n", - "843E320A28F319748425787F04BCD3B8\n", - "\n", - "\n", - "751D665EB6AE7360928F15705F9F0F48->843E320A28F319748425787F04BCD3B8\n", - "\n", - "\n", - "\n", - "\n", - "4E21D2D878A101E8804837CAD6499FD9\n", - "\n", - "airlines_feature_group\n", - "Feature Group\n", - "4E21D2D878A101E8804837CAD6499FD9\n", - "\n", - "\n", - "843E320A28F319748425787F04BCD3B8->4E21D2D878A101E8804837CAD6499FD9\n", - "\n", - "\n", - "\n", - "\n", - "6881C3E17FC9BBB02934BB7B6B9068D1\n", - "\n", - "flights_dataset\n", - "Dataset\n", - "6881C3E17FC9BBB02934BB7B6B9068D1\n", - "\n", - "\n", - "843E320A28F319748425787F04BCD3B8->6881C3E17FC9BBB02934BB7B6B9068D1\n", - "\n", - "\n", - "\n", - "\n", - "C1771CFDA79A082BB9FB85D9E5FCB192\n", - "\n", - "airport_feature_group\n", - "Feature Group\n", - "C1771CFDA79A082BB9FB85D9E5FCB192\n", - "\n", - "\n", - "843E320A28F319748425787F04BCD3B8->C1771CFDA79A082BB9FB85D9E5FCB192\n", - "\n", - "\n", - "\n", - "\n", - "C24E858807F4EBA22BF14C08B9A6E2DD\n", - "\n", - "flights_feature_group\n", - "Feature Group\n", - "C24E858807F4EBA22BF14C08B9A6E2DD\n", - "\n", - "\n", - "843E320A28F319748425787F04BCD3B8->C24E858807F4EBA22BF14C08B9A6E2DD\n", - "\n", - "\n", - "\n", - "\n", - "4E21D2D878A101E8804837CAD6499FD9->6881C3E17FC9BBB02934BB7B6B9068D1\n", - "\n", - "\n", - "\n", - "\n", - "ocid1.modelcatalog.oc1.unique_ocid\n", - "\n", - " \n", - "Model\n", - "ocid1.modelcatalog.oc1.unique_ocid\n", - "\n", - "\n", - "6881C3E17FC9BBB02934BB7B6B9068D1->ocid1.modelcatalog.oc1.unique_ocid\n", - "\n", - "\n", - "\n", - "\n", - "C1771CFDA79A082BB9FB85D9E5FCB192->6881C3E17FC9BBB02934BB7B6B9068D1\n", - "\n", - "\n", - "\n", - "\n", - "C24E858807F4EBA22BF14C08B9A6E2DD->6881C3E17FC9BBB02934BB7B6B9068D1\n", - "\n", - "\n", - "\n", - "\n", - "\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "dataset.show()" ] }, { "cell_type": "code", - "execution_count": 44, - "id": "5b46e716", + "execution_count": null, + "id": "807e340c", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+------+--------------------+--------------------+-----------+--------------------+--------------------+-------------------+----------------+--------+-----------+----------+----------------+----------------+\n", - "|format| id| name|description| location| createdAt| lastModified|partitionColumns|numFiles|sizeInBytes|properties|minReaderVersion|minWriterVersion|\n", - "+------+--------------------+--------------------+-----------+--------------------+--------------------+-------------------+----------------+--------+-----------+----------+----------------+----------------+\n", - "| delta|7b4825ef-5a04-4fb...|843e320a28f319748...| null|oci://default-sto...|2023-07-14 04:31:...|2023-07-14 04:32:11| []| 2| 9038| {}| 1| 2|\n", - "+------+--------------------+--------------------+-----------+--------------------+--------------------+-------------------+----------------+--------+-----------+----------+----------------+----------------+\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "dataset.profile().show()" ] }, { "cell_type": "code", - "execution_count": 45, - "id": "13e18a51", + "execution_count": null, + "id": "df9155bf", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+-------+----+-----+-------+--------+---------+\n", - "|YEAR|MONTH|DAY|DAY_OF_WEEK|AIRLINE|FLIGHT_NUMBER|ORIGIN_AIRPORT|DESTINATION_AIRPORT|IATA_CODE|AIRPORT|CITY|STATE|COUNTRY|LATITUDE|LONGITUDE|\n", - "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+-------+----+-----+-------+--------+---------+\n", - "|2015| 1| 1| 4| B6| 1030| BQN| MCO| null| null|null| null| null| null| null|\n", - "|2015| 1| 1| 4| B6| 262| SJU| BOS| null| null|null| null| null| null| null|\n", - "|2015| 1| 1| 4| B6| 2134| SJU| MCO| null| null|null| null| null| null| null|\n", - "|2015| 1| 1| 4| B6| 730| BQN| MCO| null| null|null| null| null| null| null|\n", - "|2015| 1| 1| 4| B6| 768| PSE| MCO| null| null|null| null| null| null| null|\n", - "|2015| 1| 1| 4| B6| 2276| SJU| BDL| null| null|null| null| null| null| null|\n", - "|2015| 1| 1| 4| US| 602| ORD| PHX| null| null|null| null| null| null| null|\n", - "|2015| 1| 1| 4| AS| 695| GEG| SEA| null| null|null| null| null| null| null|\n", - "|2015| 1| 1| 4| HA| 102| HNL| ITO| null| null|null| null| null| null| null|\n", - "|2015| 1| 1| 4| OO| 5467| ONT| SFO| null| null|null| null| null| null| null|\n", - "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+-------+----+-----+-------+--------+---------+\n", - "\n" - ] - } - ], + "outputs": [], "source": [ - "dataset.preview().show()" + "dataset.as_of(version_number=0).show()" ] }, { "cell_type": "markdown", - "id": "2f784a25", + "id": "db06133b", "metadata": { "pycharm": { "name": "#%% md\n" @@ -5038,8 +1296,8 @@ }, { "cell_type": "code", - "execution_count": 46, - "id": "79bdaf43", + "execution_count": null, + "id": "276e8053", "metadata": { "pycharm": { "name": "#%%\n" @@ -5057,62 +1315,32 @@ }, { "cell_type": "code", - "execution_count": 47, - "id": "8b02df32", + "execution_count": null, + "id": "d7987003", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+\n", - "|YEAR|MONTH|DAY|DAY_OF_WEEK|AIRLINE|FLIGHT_NUMBER|ORIGIN_AIRPORT|DESTINATION_AIRPORT|IATA_CODE|\n", - "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+\n", - "|2015| 1| 1| 4| B6| 1030| BQN| MCO| BQN|\n", - "|2015| 1| 1| 4| B6| 262| SJU| BOS| SJU|\n", - "|2015| 1| 1| 4| B6| 2134| SJU| MCO| SJU|\n", - "|2015| 1| 1| 4| B6| 730| BQN| MCO| BQN|\n", - "|2015| 1| 1| 4| B6| 768| PSE| MCO| PSE|\n", - "|2015| 1| 1| 4| B6| 2276| SJU| BDL| SJU|\n", - "|2015| 1| 1| 4| US| 602| ORD| PHX| ORD|\n", - "|2015| 1| 1| 4| AS| 695| GEG| SEA| GEG|\n", - "|2015| 1| 1| 4| HA| 102| HNL| ITO| HNL|\n", - "|2015| 1| 1| 4| OO| 5467| ONT| SFO| ONT|\n", - "|2015| 1| 1| 4| HA| 108| HNL| KOA| HNL|\n", - "|2015| 1| 1| 4| AS| 730| ANC| SEA| ANC|\n", - "|2015| 1| 1| 4| HA| 206| HNL| OGG| HNL|\n", - "|2015| 1| 1| 4| UA| 1500| ORD| IAH| ORD|\n", - "|2015| 1| 1| 4| AA| 1323| MCO| MIA| MCO|\n", - "|2015| 1| 1| 4| NK| 103| BOS| MYR| BOS|\n", - "|2015| 1| 1| 4| OO| 7404| HIB| MSP| HIB|\n", - "|2015| 1| 1| 4| OO| 7419| ABR| MSP| ABR|\n", - "|2015| 1| 1| 4| OO| 5254| MAF| IAH| MAF|\n", - "|2015| 1| 1| 4| US| 480| SEA| PHX| SEA|\n", - "+----+-----+---+-----------+-------+-------------+--------------+-------------------+---------+\n", - "only showing top 20 rows\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "feature_store.sql(sql).show()" ] }, + { + "cell_type": "markdown", + "id": "10d6f553", + "metadata": {}, + "source": [ + "\n", + "### 3.9 Feature store Entities using YAML\n", + "In an ADS feature store module, you can either use the Python programmatic interface or YAML to define feature store entities. Below section describes how to create feature store entities using YAML as an interface." + ] + }, { "cell_type": "code", - "execution_count": 48, - "id": "6d72aefa", + "execution_count": null, + "id": "67f69307", "metadata": { "pycharm": { "name": "#%%\n" @@ -5125,9 +1353,9 @@ "kind: featureStore\n", "spec:\n", " displayName: Flights feature store\n", - " compartmentId: \"\"\n", + " compartmentId: \"ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\"\n", " offlineConfig:\n", - " metastoreId: \"\"\n", + " metastoreId: \"ocid1.datacatalogmetastore.oc1.iad.amaaaaaabiudgxyap7tizm4gscwz7amu7dixz7ml3mtesqzzwwg3urvvdgua\"\n", "\n", " entity: &flights_entity\n", " - kind: entity\n", @@ -5217,266 +1445,14 @@ }, { "cell_type": "code", - "execution_count": 49, - "id": "23bc53a4", + "execution_count": null, + "id": "db2eb17e", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "fd2434312d73436fac996ff64f4f50f5", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "loop1: 0%| | 0/6 [00:00\n", "# References\n", "\n", + "- [Feature Store Documentation](https://feature-store-accelerated-data-science.readthedocs.io/en/latest/overview.html)\n", "- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)\n", "- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)\n", "- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)\n", @@ -5503,7 +1480,7 @@ { "cell_type": "code", "execution_count": null, - "id": "914eafdd", + "id": "4f95ea9b", "metadata": { "pycharm": { "name": "#%%\n" @@ -5515,9 +1492,9 @@ ], "metadata": { "kernelspec": { - "display_name": "Python [conda env:fspyspark32_p38_cpu#conda_v1]", + "display_name": "Python [conda env:fspyspark32_p38_cpu_v1]", "language": "python", - "name": "conda-env-fspyspark32_p38_cpu_conda_v1-py" + "name": "conda-env-fspyspark32_p38_cpu_v1-py" }, "language_info": { "codemirror_mode": { diff --git a/notebook_examples/feature_store_quickstart.ipynb b/notebook_examples/feature_store_quickstart.ipynb index 403a2af2..e795ccb8 100644 --- a/notebook_examples/feature_store_quickstart.ipynb +++ b/notebook_examples/feature_store_quickstart.ipynb @@ -2,17 +2,13 @@ "cells": [ { "cell_type": "raw", - "id": "4a426ee8", - "metadata": { - "pycharm": { - "name": "#%% raw\n" - } - }, + "id": "5563bdd3", + "metadata": {}, "source": [ "@notebook{feature_store-quickstart.ipynb,\n", " title: Using feature store for feature ingestion and feature querying,\n", - " summary: Feature store quickstart guide to perform feature ingestion and feature querying.,\n", - " developed_on: fs_pyspark32_p38_cpu_v1,\n", + " summary: Introduction to the Oracle Cloud Infrastructure Feature Store.Use feature store for feature ingestion and feature querying.,\n", + " developed_on: fspyspark32_p38_cpu_v1,\n", " keywords: feature store,\n", " license: Universal Permissive License v 1.0\n", "}" @@ -20,37 +16,23 @@ }, { "cell_type": "code", - "execution_count": 1, - "id": "9e98a0a2", + "execution_count": null, + "id": "35bdd0d7", "metadata": { "pycharm": { - "is_executing": true, - "name": "#%%\n" + "is_executing": true } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "zsh:1: command not found: odsc\r\n" - ] - } - ], + "outputs": [], "source": [ - "# Upgrade Oracle ADS to pick up the latest preview version to maintain compatibility with Oracle Cloud Infrastructure.\n", - "\n", - "!odsc conda install --uri https://objectstorage.us-ashburn-1.oraclecloud.com/n/bigdatadatasciencelarge/b/service-conda-packs-fs/o/service_pack/cpu/PySpark_3.2_and_Feature_Store/1.0/fspyspark32_p38_cpu_v1#conda" + "# Upgrade Oracle ADS to pick up latest features and maintain compatibility with Oracle Cloud Infrastructure.\n", + "!pip install --pre --no-deps oracle-ads==2.9.0rc0" ] }, { "cell_type": "markdown", - "id": "67dc5be9", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, + "id": "725a5e59", + "metadata": {}, "source": [ "Oracle Data Science service sample notebook.\n", "\n", @@ -66,22 +48,28 @@ "---\n", "Managing many datasets, data-sources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift and training serving skew all leads to increased model development time and worse model performance. Here, feature store is well positioned to solve many of the problems since it provides a centralised way to transform and access data for training and serving time and helps defines a standardised pipeline for ingestion of data and querying of data.\n", "\n", + "Compatible conda pack: [PySpark 3.2 and Feature Store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8 (version 1.0)\n", + "\n", "## Contents:\n", "\n", "- 1. Introduction\n", "- 2. Pre-requisites\n", - " - 2.1 Policies\n", - " - 2.2 Authentication\n", - " - 2.3 Variables\n", + " - 2.1 Setup\n", + " - 2.2 Policies\n", + " - 2.3 Authentication\n", + " - 2.4 Variables\n", "- 3. Feature store quickstart using APIs\n", - " - 3.1. Create feature store\n", - " - 3.2. Create business entity in feature store\n", - " - 3.3. Create feature group and upload data to feature group\n", - " - 3.4. Query feature group\n", - " - 3.5. Create dataset from multiple or one feature group\n", - " - 3.6 Query dataset\n", - "- 4. Feature store quickstart using YAML\n", - "- 5. References\n", + " - 3.1 Exploration of data\n", + " - 3.2 Create feature store logical entities\n", + " - 3.2.1 Create feature store\n", + " - 3.2.2 Create business entity in feature store\n", + " - 3.2.3 Create transformation in feature store\n", + " - 3.2.4 Create feature group and upload data to feature group\n", + " - 3.3 Explore feature group\n", + " - 3.4 Create dataset from multiple or one feature group\n", + " - 3.3 Explore dataset\n", + " - 4. Feature store quickstart using YAML\n", + " - 5. References\n", "\n", "---\n", "\n", @@ -93,19 +81,15 @@ "\n", "Datasets are provided as a convenience. Datasets are considered third-party content and are not considered materials under your agreement with Oracle.\n", "\n", - "This [`Citi Bike`](https://ride.citibikenyc.com/data-sharing-policy) dataset license is used in this notebook.\n", + "`Citibike` dataset is used in this notebook.You can access the citibike dataset license [here](https://ride.citibikenyc.com/data-sharing-policy)\n", "\n", "---" ] }, { "cell_type": "markdown", - "id": "d41663f1", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, + "id": "90024d60", + "metadata": {}, "source": [ "\n", "# 1. Introduction\n", @@ -132,32 +116,29 @@ }, { "cell_type": "markdown", - "id": "ce2f00ee", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, + "id": "9fb00256", + "metadata": {}, "source": [ "\n", - "# 2. Pre-requisites\n", + "# 2. Pre-requisites to Running this Notebook \n", "\n", "Notebook Sessions are accessible through the following conda environment: \n", "\n", - "* **PySpark 3.2 and Feature store 1.0 (fs_pyspark32_p38_cpu_v1)**\n", + "* **PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1)**\n", "\n", - "You can customize `fs_pyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Notebook session cluster. " + "You can customize `fspyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Notebook session cluster. " ] }, { "cell_type": "markdown", - "id": "f503e105", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, + "id": "83904ad6", + "metadata": {}, "source": [ + "\n", + "### 2.1. Setup\n", + "\n", + "To set up the environment, a `spark-defaults.conf` must be configured. Data Catalog Metastore id must also be provided.\n", + "\n", "\n", "### `spark-defaults.conf`\n", "\n", @@ -180,25 +161,16 @@ "\n", "```bash\n", "odsc data-catalog config --help\n", - "```\n", - "\n", - "\n", - "### Session Setup\n", - "\n", - "The notebook makes connections to the Data Catalog metastore and Object Storage. In the next cell, specify the bucket URI to act as the data warehouse. Use the `warehouse_uri` variable with the `oci://@/` format. Update the variable `metastore_id` with the OCID of the Data Catalog metastore." + "```" ] }, { "cell_type": "markdown", - "id": "9a781306", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, + "id": "6bdca361", + "metadata": {}, "source": [ "\n", - "### 2.1. Policies\n", + "### 2.2. Policies\n", "This section covers the creation of dynamic groups and policies needed to use the service.\n", "\n", "* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)\n", @@ -207,57 +179,41 @@ }, { "cell_type": "markdown", - "id": "2c7106e4", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, + "id": "cf094492", + "metadata": {}, "source": [ "\n", - "### 2.2. Authentication\n", + "### 2.3. Authentication\n", "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook Spark cluster.
\n", "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```. " ] }, { "cell_type": "code", - "execution_count": 2, - "id": "89bdc3aa", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, + "execution_count": null, + "id": "9f35e1a0", + "metadata": {}, "outputs": [], "source": [ "import ads\n", - "ads.set_auth(auth=\"api_key\", client_kwargs={\"service_endpoint\": \"http://{api_gateway}:21000/20230101\"})" + "ads.set_auth(auth=\"resource_principal\", client_kwargs={\"fs_service_endpoint\": \"http://{api_gateway}/20230101\"})" ] }, { "cell_type": "markdown", - "id": "d7c223c0", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, + "id": "17b184d7", + "metadata": {}, "source": [ "\n", - "### 2.3. Variables\n", - "To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `` and bucket `` for storing logs. The [Data Catalog Hive Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and unstructured data assets. The Metastore is the central metadata repository to understand tables backed by files on object storage and the metastore id of hive metastore is tied to feature store construct of feature store service." + "### 2.4. Variables\n", + "To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `` and `` which is the OCID of the Data Catalog metastore. The [Data Catalog Hive Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and unstructured data assets. The Metastore is the central metadata repository to understand tables backed by files on object storage and the metastore id of hive metastore is tied to feature store construct of feature store service." ] }, { "cell_type": "code", - "execution_count": 3, - "id": "a2ca06cb", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, + "execution_count": null, + "id": "9b7f9ecc", + "metadata": {}, "outputs": [], "source": [ "import os\n", @@ -268,84 +224,96 @@ }, { "cell_type": "markdown", - "id": "03dc9e2c", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, + "id": "931d2532", + "metadata": {}, "source": [ "\n", "# 3. Feature store quick start using APIs\n", - "By default the **PySpark 3.2, Feature store and Data Flow** conda environment includes pre-installed [great-expectations](https://legacy.docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html) and [deeque](https://github.com/awslabs/deequ) libraries. In an ADS feature store module, you can either use the Python programmatic or YAML interface to define feature store entities. Below section describes how to create feature store entities using programmatic interface." + "By default the **PySpark 3.2 and Feature Store Python 3.8** conda environment includes pre-installed [great-expectations](https://legacy.docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html) library. In an ADS feature store module, you can either use the Python programmatic or YAML interface to define feature store entities. Below section describes how to create feature store entities using programmatic interface." ] }, { "cell_type": "code", - "execution_count": 4, - "id": "3bfeace2", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:py.warnings:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/sql/pandas/utils.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", - "\n", - "WARNING:py.warnings:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/sql/pandas/utils.py:64: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", - "\n", - "WARNING:py.warnings:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/pandas/__init__.py:46: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " LooseVersion(pyarrow.__version__) >= LooseVersion(\"2.0.0\")\n", - "\n", - "WARNING:py.warnings:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/pandas/__init__.py:49: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.\n", - " warnings.warn(\n", - "\n", - "WARNING:py.warnings:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/pandas/groupby.py:49: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pd.__version__) >= LooseVersion(\"1.3.0\"):\n", - "\n", - "ERROR:logger:Please set env variable SPARK_VERSION\n", - "INFO:logger:Using deequ: com.amazon.deequ:deequ:1.2.2-spark-3.0\n" - ] - } - ], + "execution_count": null, + "id": "9c8018d4", + "metadata": {}, + "outputs": [], "source": [ "import pandas as pd \n", "from ads.feature_store.feature_store import FeatureStore\n", "from ads.feature_store.dataset import Dataset\n", "from ads.feature_store.feature_group import FeatureGroup\n", "from ads.feature_store.feature_store_registrar import FeatureStoreRegistrar\n", - "from ads.feature_store.common.enums import ExpectationType" + "from ads.feature_store.common.enums import ExpectationType\n", + "from great_expectations.core import ExpectationSuite, ExpectationConfiguration\n", + "from ads.feature_store.transformation import TransformationMode" ] }, { "cell_type": "markdown", - "id": "2b3fad36", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, + "id": "4e007b50", + "metadata": {}, "source": [ - "\n", - "### 3.1 Create feature store\n", - "Feature store is a top level construct to provide logical segregation of resources" + "\n", + "### 3.1 Exploration of data" ] }, { "cell_type": "code", - "execution_count": 5, - "id": "4688d55b", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, + "execution_count": null, + "id": "5882786a", + "metadata": {}, + "outputs": [], + "source": [ + "bike_df = pd.read_csv(\"data/201901-citibike-tripdata.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a5c9b752", + "metadata": {}, + "outputs": [], + "source": [ + "bike_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e09f121", + "metadata": {}, + "outputs": [], + "source": [ + "bike_df.columns = bike_df.columns.str.replace(' ', '')" + ] + }, + { + "cell_type": "markdown", + "id": "58a3b034", + "metadata": {}, + "source": [ + "\n", + "### 3.2. Create feature store logical entities" + ] + }, + { + "cell_type": "markdown", + "id": "0faeae33", + "metadata": {}, + "source": [ + "\n", + "#### 3.2.1 Feature Store\n", + "\n", + "Feature store is the top level entity for feature store service.\n", + "Call the ```.create()``` method of the Feature store instance to create a feature store." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3548e0a6", + "metadata": {}, "outputs": [], "source": [ "feature_store_resource = (\n", @@ -359,13 +327,9 @@ }, { "cell_type": "code", - "execution_count": 6, - "id": "191d1d31", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, + "execution_count": null, + "id": "e79d9727", + "metadata": {}, "outputs": [], "source": [ "feature_store = feature_store_resource.create()" @@ -373,27 +337,19 @@ }, { "cell_type": "markdown", - "id": "0ba52241", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, + "id": "aca2d27c", + "metadata": {}, "source": [ "\n", - "### 3.2 Create entity\n", + "#### 3.2.2 Entity\n", "An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Another way to look at it is that an entity is an object or concept that is described by its features. Examples of entities could be customer, product, transaction, review, image, document, etc." ] }, { "cell_type": "code", - "execution_count": 7, - "id": "f3fff48b", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, + "execution_count": null, + "id": "28e9762c", + "metadata": {}, "outputs": [], "source": [ "entity = feature_store.create_entity(\n", @@ -404,241 +360,70 @@ }, { "cell_type": "markdown", - "id": "8f1d165b", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, + "id": "6a1ec785", + "metadata": {}, "source": [ - "\n", - "### 3.3 Create feature group\n", - "A feature group is the code that contains instructions on the ingestion of raw data and computation of the feature. This [`Citi Bike`](https://ride.citibikenyc.com/data-sharing-policy) dataset license is used in this notebook. values. " + "\n", + "#### 3.2.3 Transformation\n", + "Transformations in a feature store refers to the operations and processes applied to raw data to create, modify or derive new features that can be used as inputs for ML Models" ] }, { "cell_type": "code", - "execution_count": 8, - "id": "6aaac72f", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, + "execution_count": null, + "id": "dc898997", + "metadata": {}, "outputs": [], "source": [ - "bike_df = pd.read_csv(\"/data/flights-data/archives/201901-citibike-tripdata.csv\")" + "def is_round_trip(bike_df):\n", + " bike_df['roundtrip'] = bike_df['startstationid'] == bike_df['endstationid']\n", + " return bike_df" ] }, { "cell_type": "code", - "execution_count": 9, - "id": "47140320", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, + "execution_count": null, + "id": "d624680b", + "metadata": {}, "outputs": [], "source": [ - "bike_df = bike_df.drop(['start station name', 'end station name'], axis=1)\n", - "bike_df.columns = bike_df.columns.str.replace(' ', '')" + "transformation = feature_store.create_transformation(\n", + " transformation_mode=TransformationMode.PANDAS,\n", + " source_code_func=is_round_trip,\n", + " display_name=\"is_round_trip\",\n", + ")\n", + "transformation" ] }, { - "cell_type": "code", - "execution_count": 10, - "id": "e87a1587", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
tripdurationstarttimestoptimestartstationidstartstationlatitudestartstationlongitudeendstationidendstationlatitudeendstationlongitudebikeidusertypebirthyeargender
03202019-01-01 00:01:47.40102019-01-01 00:07:07.58103160.040.778968-73.9737473283.040.788221-73.97041615839Subscriber19711
13162019-01-01 00:04:43.73602019-01-01 00:10:00.6080519.040.751873-73.977706518.040.747804-73.97344232723Subscriber19641
25912019-01-01 00:06:03.99702019-01-01 00:15:55.43803171.040.785247-73.9766733154.040.773142-73.95856227451Subscriber19871
327192019-01-01 00:07:03.54502019-01-01 00:52:22.6500504.040.732219-73.9816563709.040.738046-73.99643021579Subscriber19901
43032019-01-01 00:07:35.94502019-01-01 00:12:39.5020229.040.727434-73.993790503.040.738274-73.98752035379Subscriber19791
\n", - "
" - ], - "text/plain": [ - " tripduration starttime stoptime \\\n", - "0 320 2019-01-01 00:01:47.4010 2019-01-01 00:07:07.5810 \n", - "1 316 2019-01-01 00:04:43.7360 2019-01-01 00:10:00.6080 \n", - "2 591 2019-01-01 00:06:03.9970 2019-01-01 00:15:55.4380 \n", - "3 2719 2019-01-01 00:07:03.5450 2019-01-01 00:52:22.6500 \n", - "4 303 2019-01-01 00:07:35.9450 2019-01-01 00:12:39.5020 \n", - "\n", - " startstationid startstationlatitude startstationlongitude endstationid \\\n", - "0 3160.0 40.778968 -73.973747 3283.0 \n", - "1 519.0 40.751873 -73.977706 518.0 \n", - "2 3171.0 40.785247 -73.976673 3154.0 \n", - "3 504.0 40.732219 -73.981656 3709.0 \n", - "4 229.0 40.727434 -73.993790 503.0 \n", - "\n", - " endstationlatitude endstationlongitude bikeid usertype birthyear \\\n", - "0 40.788221 -73.970416 15839 Subscriber 1971 \n", - "1 40.747804 -73.973442 32723 Subscriber 1964 \n", - "2 40.773142 -73.958562 27451 Subscriber 1987 \n", - "3 40.738046 -73.996430 21579 Subscriber 1990 \n", - "4 40.738274 -73.987520 35379 Subscriber 1979 \n", - "\n", - " gender \n", - "0 1 \n", - "1 1 \n", - "2 1 \n", - "3 1 \n", - "4 1 " - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], + "cell_type": "markdown", + "id": "550abcbb", + "metadata": {}, "source": [ - "bike_df.head()" + "\n", + "#### 3.2.4 Feature group\n", + "A feature group is an object that represents a logical group of time-series feature data as it is found in a datasource. " ] }, { - "cell_type": "code", - "execution_count": 11, - "id": "e704bb08", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{\"expectation_type\": \"expect_column_values_to_not_be_null\", \"meta\": {}, \"kwargs\": {\"column\": \"stoptime\"}}" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], + "cell_type": "markdown", + "id": "22c00f3a", + "metadata": {}, "source": [ - "from great_expectations.core import ExpectationSuite, ExpectationConfiguration\n", + "\n", + "##### 3.2.4.1 Associate Expectation Suite\n", + "Feature validation is the process of checking the quality and accuracy of the features used in a machine learning model.Feature store allows you to define expectation on the data which is being materialized into feature group and dataset.This is achieved using open source library Great Expectations.\n", "\n", + "An Expectation is a verifiable assertion about your data. You can define expectation as below:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d5a352f", + "metadata": {}, + "outputs": [], + "source": [ "expectation_suite = ExpectationSuite(expectation_suite_name=\"feature_definition\")\n", "expectation_suite.add_expectation(\n", " ExpectationConfiguration(\n", @@ -650,13 +435,9 @@ }, { "cell_type": "code", - "execution_count": 12, - "id": "02d6fc25", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, + "execution_count": null, + "id": "2f9cc4e8", + "metadata": {}, "outputs": [], "source": [ "feature_group_bike = (\n", @@ -668,722 +449,187 @@ " .with_compartment_id(compartment_id)\n", " .with_schema_details_from_dataframe(bike_df)\n", " .with_expectation_suite(expectation_suite, ExpectationType.LENIENT)\n", + " .with_transformation_id(transformation.id)\n", ")" ] }, { "cell_type": "code", - "execution_count": 13, - "id": "228401d1", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "kind: FeatureGroup\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\n", - " entityId: 1C29D0DF65E456211B7351D85F271E03\n", - " expectationDetails:\n", - " createRuleDetails:\n", - " - arguments:\n", - " column: stoptime\n", - " levelType: ERROR\n", - " name: Rule-0\n", - " ruleType: expect_column_values_to_not_be_null\n", - " expectationType: LENIENT\n", - " name: feature_definition\n", - " validationEngineType: GREAT_EXPECTATIONS\n", - " featureStoreId: AB5F8E0C4BD86255C3828039D8C51853\n", - " id: 60E6662F04168EEFE781D7ACE576F339\n", - " inputFeatureDetails:\n", - " - featureType: INTEGER\n", - " name: tripduration\n", - " orderNumber: 1\n", - " - featureType: STRING\n", - " name: starttime\n", - " orderNumber: 2\n", - " - featureType: STRING\n", - " name: stoptime\n", - " orderNumber: 3\n", - " - featureType: FLOAT\n", - " name: startstationid\n", - " orderNumber: 4\n", - " - featureType: FLOAT\n", - " name: startstationlatitude\n", - " orderNumber: 5\n", - " - featureType: FLOAT\n", - " name: startstationlongitude\n", - " orderNumber: 6\n", - " - featureType: FLOAT\n", - " name: endstationid\n", - " orderNumber: 7\n", - " - featureType: FLOAT\n", - " name: endstationlatitude\n", - " orderNumber: 8\n", - " - featureType: FLOAT\n", - " name: endstationlongitude\n", - " orderNumber: 9\n", - " - featureType: INTEGER\n", - " name: bikeid\n", - " orderNumber: 10\n", - " - featureType: STRING\n", - " name: usertype\n", - " orderNumber: 11\n", - " - featureType: INTEGER\n", - " name: birthyear\n", - " orderNumber: 12\n", - " - featureType: INTEGER\n", - " name: gender\n", - " orderNumber: 13\n", - " name: bike_feature_group\n", - " primaryKeys:\n", - " items:\n", - " - name: bikeid\n", - " statisticsConfig:\n", - " isEnabled: true\n", - "type: featureGroup" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "id": "67b8b4ef", + "metadata": {}, + "outputs": [], "source": [ "feature_group_bike.create()" ] }, + { + "cell_type": "markdown", + "id": "fbe3f5bf", + "metadata": {}, + "source": [ + "\n", + "To persist the feature group and save feature data along with the metadata in the feature store, call the `materialise()` method with data frame." + ] + }, { "cell_type": "code", - "execution_count": 14, - "id": "98afef8e", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, + "execution_count": null, + "id": "f63e15f1", + "metadata": {}, "outputs": [], "source": [ - "os.environ[\"DEVELOPER_MODE\"] = \"True\"" + "feature_group_bike.materialise(bike_df)" + ] + }, + { + "cell_type": "markdown", + "id": "d9ac48a1", + "metadata": {}, + "source": [ + "\n", + "### 3.3. Explore feature groups" + ] + }, + { + "cell_type": "markdown", + "id": "0377adfa", + "metadata": {}, + "source": [ + "You can retrieve feature data in a DataFrame, that can either be used directly to train models or materialized to file(s) for later use to train models" ] }, { "cell_type": "code", - "execution_count": 15, - "id": "732e20e8", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - ":: loading settings :: url = jar:file:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Ivy Default Cache set to: /Users/kshitizlohia/.ivy2/cache\n", - "The jars for the packages stored in: /Users/kshitizlohia/.ivy2/jars\n", - "io.delta#delta-core_2.12 added as a dependency\n", - ":: resolving dependencies :: org.apache.spark#spark-submit-parent-e96bd2ce-ad22-46d2-bd46-aa51029113aa;1.0\n", - "\tconfs: [default]\n", - "\tfound io.delta#delta-core_2.12;2.3.0 in central\n", - "\tfound io.delta#delta-storage;2.3.0 in central\n", - "\tfound org.antlr#antlr4-runtime;4.8 in local-m2-cache\n", - ":: resolution report :: resolve 137ms :: artifacts dl 25ms\n", - "\t:: modules in use:\n", - "\tio.delta#delta-core_2.12;2.3.0 from central in [default]\n", - "\tio.delta#delta-storage;2.3.0 from central in [default]\n", - "\torg.antlr#antlr4-runtime;4.8 from local-m2-cache in [default]\n", - "\t---------------------------------------------------------------------\n", - "\t| | modules || artifacts |\n", - "\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n", - "\t---------------------------------------------------------------------\n", - "\t| default | 3 | 0 | 0 | 0 || 3 | 0 |\n", - "\t---------------------------------------------------------------------\n", - ":: retrieving :: org.apache.spark#spark-submit-parent-e96bd2ce-ad22-46d2-bd46-aa51029113aa\n", - "\tconfs: [default]\n", - "\t0 artifacts copied, 3 already retrieved (0kB/8ms)\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "23/05/16 18:29:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting default log level to \"WARN\".\n", - "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:py.warnings:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/sql/pandas/utils.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", - "\n", - "WARNING:py.warnings:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/sql/pandas/conversion.py:474: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.\n", - " for column, series in pdf.iteritems():\n", - "\n", - "WARNING:py.warnings:/Users/kshitizlohia/IdeaProjects/oracle/feature-store/advanced-ds/venv/lib/python3.10/site-packages/pyspark/sql/pandas/conversion.py:486: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.\n", - " for column, series in pdf.iteritems():\n", - "\n", - "INFO:great_expectations.validator.validator:\t1 expectation(s) included in expectation_suite.\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "64ddfd3353dd457c99630a61d89fe748", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Calculating Metrics: 0%| | 0/6 [00:00 (0 + 8) / 8]\r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "23/05/16 18:30:05 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory\n", - "Scaling row group sizes to 96.54% for 7 writers\n", - "23/05/16 18:30:05 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory\n", - "Scaling row group sizes to 84.47% for 8 writers\n", - "23/05/16 18:30:07 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory\n", - "Scaling row group sizes to 96.54% for 7 writers\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "\r", - " \r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "23/05/16 18:30:11 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "23/05/16 18:30:15 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `1c29d0df65e456211b7351d85f271e03`.`bike_feature_group` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - } - ], + "execution_count": null, + "id": "54116cfa", + "metadata": {}, + "outputs": [], "source": [ - "feature_group_bike.materialise(bike_df)" + "query = feature_group_bike.select() \n", + "query.show()" + ] + }, + { + "cell_type": "markdown", + "id": "9f022e11", + "metadata": {}, + "source": [ + "You can call the `get_statistics()` method of the feature group to fetch statistics for a specific ingestion job.You can use `to_pandas()` or `to_json()` to view the statistics." ] }, { "cell_type": "code", - "execution_count": 16, - "id": "711efb2e", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
endstationlongitudetripdurationbikeidstartstationlongitudeendstationidusertypestarttimestartstationidendstationlatitudestartstationlatitudebirthyearstoptimegender
completeness1.01.01.01.01.01.01.01.01.01.01.01.01.0
approximateNumDistinctValues83929483932104858986361013
dataTypeFractionalIntegralIntegralFractionalFractionalStringStringFractionalFractionalFractionalIntegralStringIntegral
sum-7398.15000476840.02914421.0-7398.157728155797.0NaNNaN186276.04074.015994074.092498198127.0NaN118.0
min-74.01658497.014656.0-74.012723127.0NaNNaN79.040.66860340.6681271949.0NaN0.0
max-73.9419953494.035789.0-73.9422373709.0NaNNaN3675.040.81079240.8042131999.0NaN2.0
mean-73.9815768.429144.21-73.9815771557.97NaNNaN1862.7640.7401640.7409251981.27NaN1.18
stddev0.018151686.1878466319.2343260.0174651428.093551NaNNaN1438.055320.0318280.0325911.713117NaN0.497594
\n", - "
" - ], - "text/plain": [ - " endstationlongitude tripduration bikeid \\\n", - "completeness 1.0 1.0 1.0 \n", - "approximateNumDistinctValues 83 92 94 \n", - "dataType Fractional Integral Integral \n", - "sum -7398.150004 76840.0 2914421.0 \n", - "min -74.016584 97.0 14656.0 \n", - "max -73.941995 3494.0 35789.0 \n", - "mean -73.9815 768.4 29144.21 \n", - "stddev 0.018151 686.187846 6319.234326 \n", - "\n", - " startstationlongitude endstationid usertype \\\n", - "completeness 1.0 1.0 1.0 \n", - "approximateNumDistinctValues 83 93 2 \n", - "dataType Fractional Fractional String \n", - "sum -7398.157728 155797.0 NaN \n", - "min -74.012723 127.0 NaN \n", - "max -73.942237 3709.0 NaN \n", - "mean -73.981577 1557.97 NaN \n", - "stddev 0.017465 1428.093551 NaN \n", - "\n", - " starttime startstationid endstationlatitude \\\n", - "completeness 1.0 1.0 1.0 \n", - "approximateNumDistinctValues 104 85 89 \n", - "dataType String Fractional Fractional \n", - "sum NaN 186276.0 4074.01599 \n", - "min NaN 79.0 40.668603 \n", - "max NaN 3675.0 40.810792 \n", - "mean NaN 1862.76 40.74016 \n", - "stddev NaN 1438.05532 0.031828 \n", - "\n", - " startstationlatitude birthyear stoptime \\\n", - "completeness 1.0 1.0 1.0 \n", - "approximateNumDistinctValues 86 36 101 \n", - "dataType Fractional Integral String \n", - "sum 4074.092498 198127.0 NaN \n", - "min 40.668127 1949.0 NaN \n", - "max 40.804213 1999.0 NaN \n", - "mean 40.740925 1981.27 NaN \n", - "stddev 0.03259 11.713117 NaN \n", - "\n", - " gender \n", - "completeness 1.0 \n", - "approximateNumDistinctValues 3 \n", - "dataType Integral \n", - "sum 118.0 \n", - "min 0.0 \n", - "max 2.0 \n", - "mean 1.18 \n", - "stddev 0.497594 " - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "id": "00b66cbe", + "metadata": {}, + "outputs": [], "source": [ "feature_group_bike.get_statistics().to_pandas()" ] }, + { + "cell_type": "markdown", + "id": "8adf24e2", + "metadata": {}, + "source": [ + "You can visualize feature statistics with `to_viz()`" + ] + }, { "cell_type": "code", - "execution_count": 25, - "id": "5bfcded2", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
successresultsstatistics.evaluated_expectationsstatistics.successful_expectationsstatistics.unsuccessful_expectationsstatistics.success_percentmeta.great_expectations_versionmeta.expectation_suite_namemeta.run_id.run_timemeta.run_id.run_namemeta.batch_markers.ge_load_timemeta.active_batch_definition.datasource_namemeta.active_batch_definition.data_connector_namemeta.active_batch_definition.data_asset_namemeta.active_batch_definition.batch_identifiers.ge_batch_idmeta.validation_timemeta.checkpoint_name
0True[{'expectation_config': {'expectation_type': 'expect_column_values_to_not_be_null', 'meta': {}, 'kwargs': {'column': 'stoptime', 'batch_id': 'feca776acdd0aa61ae53da7b674430a1'}}, 'exception_info': {'raised_exception': False, 'exception_traceback': None, 'exception_message': None}, 'result': {'element_count': 100, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'partial_unexpected_list': []}, 'success': True, 'meta': {}}]110100.00.16.10bike_feature_group2023-05-16T18:29:58.670292+05:30None20230516T125958.669418Zfeature-ingestion-pipelinefeature-ingestion-pipelinefeature-ingestion-pipeline8ff83c32-f3e9-11ed-aedd-b29c4acce13020230516T125958.670193ZNone
\n", - "
" - ], - "text/plain": [ - " success \\\n", - "0 True \n", - "\n", - " results \\\n", - "0 [{'expectation_config': {'expectation_type': 'expect_column_values_to_not_be_null', 'meta': {}, 'kwargs': {'column': 'stoptime', 'batch_id': 'feca776acdd0aa61ae53da7b674430a1'}}, 'exception_info': {'raised_exception': False, 'exception_traceback': None, 'exception_message': None}, 'result': {'element_count': 100, 'unexpected_count': 0, 'unexpected_percent': 0.0, 'partial_unexpected_list': []}, 'success': True, 'meta': {}}] \n", - "\n", - " statistics.evaluated_expectations statistics.successful_expectations \\\n", - "0 1 1 \n", - "\n", - " statistics.unsuccessful_expectations statistics.success_percent \\\n", - "0 0 100.0 \n", - "\n", - " meta.great_expectations_version meta.expectation_suite_name \\\n", - "0 0.16.10 bike_feature_group \n", - "\n", - " meta.run_id.run_time meta.run_id.run_name \\\n", - "0 2023-05-16T18:29:58.670292+05:30 None \n", - "\n", - " meta.batch_markers.ge_load_time \\\n", - "0 20230516T125958.669418Z \n", - "\n", - " meta.active_batch_definition.datasource_name \\\n", - "0 feature-ingestion-pipeline \n", - "\n", - " meta.active_batch_definition.data_connector_name \\\n", - "0 feature-ingestion-pipeline \n", - "\n", - " meta.active_batch_definition.data_asset_name \\\n", - "0 feature-ingestion-pipeline \n", - "\n", - " meta.active_batch_definition.batch_identifiers.ge_batch_id \\\n", - "0 8ff83c32-f3e9-11ed-aedd-b29c4acce130 \n", - "\n", - " meta.validation_time meta.checkpoint_name \n", - "0 20230516T125958.670193Z None " - ] - }, - "execution_count": 25, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "id": "09afd99d", + "metadata": {}, + "outputs": [], "source": [ - "feature_group_bike.get_validation_output_df()" + "feature_group_bike.get_statistics().to_viz()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1a9a05fa", + "metadata": {}, + "outputs": [], + "source": [ + "feature_group_bike.get_statistics().to_viz([\"birthyear\"])" ] }, { "cell_type": "markdown", - "id": "b7ba161c", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, + "id": "088f602c", + "metadata": {}, "source": [ - "\n", - "### 3.4 Query feature group\n", - "Feature store provides a DataFrame API to ingest data into the Feature Store. You can also retrieve feature data in a DataFrame, that can either be used directly to train models or materialized to file(s) for later use to train models" + "You can call the `get_validation_output()` method of the FeatureGroup instance to fetch validation results for a specific ingestion job." ] }, { "cell_type": "code", - "execution_count": 17, - "id": "c175849c", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+------------+--------------------+--------------------+--------------+--------------------+---------------------+------------+------------------+-------------------+------+----------+---------+------+\n", - "|tripduration| starttime| stoptime|startstationid|startstationlatitude|startstationlongitude|endstationid|endstationlatitude|endstationlongitude|bikeid| usertype|birthyear|gender|\n", - "+------------+--------------------+--------------------+--------------+--------------------+---------------------+------------+------------------+-------------------+------+----------+---------+------+\n", - "| 976|2019-01-01 00:15:...|2019-01-01 00:31:...| 3452.0| 40.71915571696044| -73.94885390996933| 251.0| 40.72317958| -73.99480012| 35685|Subscriber| 1994| 1|\n", - "| 97|2019-01-01 00:15:...|2019-01-01 00:17:...| 3430.0| 40.71907891179564| -73.94223690032959| 3095.0| 40.71929301| -73.94500379| 34307|Subscriber| 1988| 1|\n", - "| 467|2019-01-01 00:16:...|2019-01-01 00:24:...| 507.0| 40.73912601| -73.97973776| 492.0| 40.75019995| -73.99093085| 35561|Subscriber| 1989| 1|\n", - "| 348|2019-01-01 00:17:...|2019-01-01 00:23:...| 3095.0| 40.71929301| -73.94500379| 3101.0| 40.72079821| -73.95484712| 35695|Subscriber| 1988| 1|\n", - "| 505|2019-01-01 00:18:...|2019-01-01 00:27:...| 3132.0| 40.76350532| -73.97109243| 359.0| 40.75510267| -73.97498696| 31801|Subscriber| 1981| 1|\n", - "| 3494|2019-01-01 00:18:...|2019-01-01 01:17:...| 3171.0| 40.78524672| -73.97667321| 3164.0| 40.7770575| -73.97898475| 35785|Subscriber| 1954| 1|\n", - "| 829|2019-01-01 00:19:...|2019-01-01 00:32:...| 3165.0| 40.77579376683666| -73.9762057363987| 3295.0| 40.79127| -73.964839| 32106|Subscriber| 1969| 0|\n", - "| 451|2019-01-01 00:21:...|2019-01-01 00:28:...| 403.0| 40.72502876| -73.99069656| 545.0| 40.736502| -73.97809472| 32038|Subscriber| 1985| 1|\n", - "| 736|2019-01-01 00:21:...|2019-01-01 00:33:...| 3165.0| 40.77579376683666| -73.9762057363987| 3295.0| 40.79127| -73.964839| 16761| Customer| 1989| 2|\n", - "| 617|2019-01-01 00:21:...|2019-01-01 00:31:...| 3159.0| 40.77492513| -73.98266566| 3142.0| 40.7612274| -73.96094022| 24895|Subscriber| 1998| 1|\n", - "+------------+--------------------+--------------------+--------------+--------------------+---------------------+------------+------------------+-------------------+------+----------+---------+------+\n", - "only showing top 10 rows\n", - "\n" - ] - } - ], + "execution_count": null, + "id": "8dd4687f", + "metadata": {}, + "outputs": [], "source": [ - "query = feature_group_bike.select() \n", - "query.show()" + "feature_group_bike.get_validation_output().to_pandas()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce9db608", + "metadata": {}, + "outputs": [], + "source": [ + "feature_group_bike.get_validation_output().to_summary()" ] }, { "cell_type": "markdown", - "id": "962e563d", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, + "id": "e468f448", + "metadata": {}, + "source": [ + "\n", + "#### Visualise lineage\n", + "\n", + "Use the ```.show()``` method on the FeatureGroup instance to visualize the lineage of the featuregroup." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e147e248", + "metadata": {}, + "outputs": [], + "source": [ + "feature_group_bike.show()" + ] + }, + { + "cell_type": "markdown", + "id": "e635249e", + "metadata": {}, "source": [ "\n", - "### 3.5 Create dataset\n", + "### 3.4 Create dataset\n", "A dataset is a collection of feature snapshots that are joined together to either train a model or perform model inference." ] }, { "cell_type": "code", - "execution_count": 18, - "id": "147ae5bd", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [ - { - "data": { - "text/plain": [ - "'SELECT fg_0.tripduration tripduration, fg_0.starttime starttime, fg_0.stoptime stoptime, fg_0.startstationid startstationid, fg_0.startstationlatitude startstationlatitude, fg_0.startstationlongitude startstationlongitude, fg_0.endstationid endstationid, fg_0.endstationlatitude endstationlatitude, fg_0.endstationlongitude endstationlongitude, fg_0.bikeid bikeid, fg_0.usertype usertype, fg_0.birthyear birthyear, fg_0.gender gender FROM `1C29D0DF65E456211B7351D85F271E03`.bike_feature_group fg_0'" - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "id": "bc169f01", + "metadata": {}, + "outputs": [], "source": [ "query.to_string()" ] }, { "cell_type": "code", - "execution_count": 19, - "id": "440b129e", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, + "execution_count": null, + "id": "52f9a271", + "metadata": {}, "outputs": [], "source": [ "dataset_resource = (\n", @@ -1399,302 +645,106 @@ }, { "cell_type": "code", - "execution_count": 20, - "id": "10dd5758", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, + "execution_count": null, + "id": "d8661c89", + "metadata": {}, "outputs": [], "source": [ "dataset = dataset_resource.create()" ] }, + { + "cell_type": "markdown", + "id": "baaf2112", + "metadata": {}, + "source": [ + "You can call the `materialise()` method of the Dataset instance to load the data to dataset." + ] + }, { "cell_type": "code", - "execution_count": 21, - "id": "d4b077da", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "23/05/16 18:31:37 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `1c29d0df65e456211b7351d85f271e03`.`bike_riders_dataset` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - } - ], + "execution_count": null, + "id": "7228ed61", + "metadata": {}, + "outputs": [], "source": [ "dataset.materialise()" ] }, + { + "cell_type": "markdown", + "id": "b1b09af2", + "metadata": {}, + "source": [ + "\n", + "### 3.5 Explore dataset" + ] + }, { "cell_type": "code", - "execution_count": 22, - "id": "db5d6854", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
endstationlongitudetripdurationbikeidstartstationlongitudeendstationidusertypestarttimestartstationidendstationlatitudestartstationlatitudebirthyearstoptimegender
completeness1.01.01.01.01.01.01.01.01.01.01.01.01.0
approximateNumDistinctValues83929483932104858986361013
dataTypeFractionalIntegralIntegralFractionalFractionalStringStringFractionalFractionalFractionalIntegralStringIntegral
sum-7398.15000476840.02914421.0-7398.157728155797.0NaNNaN186276.04074.015994074.092498198127.0NaN118.0
min-74.01658497.014656.0-74.012723127.0NaNNaN79.040.66860340.6681271949.0NaN0.0
max-73.9419953494.035789.0-73.9422373709.0NaNNaN3675.040.81079240.8042131999.0NaN2.0
mean-73.9815768.429144.21-73.9815771557.97NaNNaN1862.7640.7401640.7409251981.27NaN1.18
stddev0.018151686.1878466319.2343260.0174651428.093551NaNNaN1438.055320.0318280.0325911.713117NaN0.497594
\n", - "
" - ], - "text/plain": [ - " endstationlongitude tripduration bikeid \\\n", - "completeness 1.0 1.0 1.0 \n", - "approximateNumDistinctValues 83 92 94 \n", - "dataType Fractional Integral Integral \n", - "sum -7398.150004 76840.0 2914421.0 \n", - "min -74.016584 97.0 14656.0 \n", - "max -73.941995 3494.0 35789.0 \n", - "mean -73.9815 768.4 29144.21 \n", - "stddev 0.018151 686.187846 6319.234326 \n", - "\n", - " startstationlongitude endstationid usertype \\\n", - "completeness 1.0 1.0 1.0 \n", - "approximateNumDistinctValues 83 93 2 \n", - "dataType Fractional Fractional String \n", - "sum -7398.157728 155797.0 NaN \n", - "min -74.012723 127.0 NaN \n", - "max -73.942237 3709.0 NaN \n", - "mean -73.981577 1557.97 NaN \n", - "stddev 0.017465 1428.093551 NaN \n", - "\n", - " starttime startstationid endstationlatitude \\\n", - "completeness 1.0 1.0 1.0 \n", - "approximateNumDistinctValues 104 85 89 \n", - "dataType String Fractional Fractional \n", - "sum NaN 186276.0 4074.01599 \n", - "min NaN 79.0 40.668603 \n", - "max NaN 3675.0 40.810792 \n", - "mean NaN 1862.76 40.74016 \n", - "stddev NaN 1438.05532 0.031828 \n", - "\n", - " startstationlatitude birthyear stoptime \\\n", - "completeness 1.0 1.0 1.0 \n", - "approximateNumDistinctValues 86 36 101 \n", - "dataType Fractional Integral String \n", - "sum 4074.092498 198127.0 NaN \n", - "min 40.668127 1949.0 NaN \n", - "max 40.804213 1999.0 NaN \n", - "mean 40.740925 1981.27 NaN \n", - "stddev 0.03259 11.713117 NaN \n", - "\n", - " gender \n", - "completeness 1.0 \n", - "approximateNumDistinctValues 3 \n", - "dataType Integral \n", - "sum 118.0 \n", - "min 0.0 \n", - "max 2.0 \n", - "mean 1.18 \n", - "stddev 0.497594 " - ] - }, - "execution_count": 22, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "id": "028c72dc", + "metadata": {}, + "outputs": [], + "source": [ + "dataset.as_of(version_number=0).show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d5e2c54d", + "metadata": {}, + "outputs": [], "source": [ "dataset.get_statistics().to_pandas()" ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "dd6e28d2", + "metadata": {}, + "outputs": [], + "source": [ + "dataset.get_statistics().to_viz()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4fd4ed61", + "metadata": {}, + "outputs": [], + "source": [ + "dataset.profile().show()" + ] + }, { "cell_type": "markdown", - "id": "38da2a60", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, + "id": "76558b69", + "metadata": {}, + "source": [ + "\n", + "#### Visualise lineage\n", + "\n", + "Use the ```.show()``` method on the Dataset instance to visualize the lineage of the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8031042b", + "metadata": {}, + "outputs": [], + "source": [ + "dataset.show()" + ] + }, + { + "cell_type": "markdown", + "id": "e9aab9aa", + "metadata": {}, "source": [ "\n", "# 4. Feature store quick start using YAML\n", @@ -1703,13 +753,9 @@ }, { "cell_type": "code", - "execution_count": 23, - "id": "d3aa939e", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, + "execution_count": null, + "id": "1cf18dd5", + "metadata": {}, "outputs": [], "source": [ "feature_store_yaml = \"\"\"\n", @@ -1717,9 +763,9 @@ "kind: featureStore\n", "spec:\n", " displayName: Bike feature store\n", - " compartmentId: \"ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\"\n", + " compartmentId: \n", " offlineConfig:\n", - " metastoreId: \"ocid1.datacatalogmetastore.oc1.iad.amaaaaaabiudgxyap7tizm4gscwz7amu7dixz7ml3mtesqzzwwg3urvvdgua\"\n", + " metastoreId: \n", "\n", " entity: &bike_entity\n", " - kind: entity\n", @@ -1759,139 +805,10 @@ }, { "cell_type": "code", - "execution_count": 24, - "id": "238a8507", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "75021638e00044e09f9dfa4e15aa6ce9", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "loop1: 0%| | 0/4 [00:00\n", "# References\n", "\n", + "- [Feature Store Documentation](https://feature-store-accelerated-data-science.readthedocs.io/en/latest/overview.html)\n", "- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)\n", "- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)\n", "- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)\n", "- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)" ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bb23af05", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "Python [conda env:fspyspark32_p38_cpu_v1]", "language": "python", - "name": "python3" + "name": "conda-env-fspyspark32_p38_cpu_v1-py" }, "language_info": { "codemirror_mode": { @@ -1932,7 +854,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.10" + "version": "3.8.17" } }, "nbformat": 4, diff --git a/notebook_examples/feature_store_schema_evolution.ipynb b/notebook_examples/feature_store_schema_evolution.ipynb index 940bca73..a430b97f 100644 --- a/notebook_examples/feature_store_schema_evolution.ipynb +++ b/notebook_examples/feature_store_schema_evolution.ipynb @@ -2,53 +2,37 @@ "cells": [ { "cell_type": "raw", - "id": "01cacd8a", + "id": "6e72604a", "metadata": {}, "source": [ "qweews@notebook{feature_store-querying.ipynb,\n", - " title: Using feature store for feature querying using pandas like interface for query and join,\n", - " summary: Feature store quickstart guide to perform feature querying using pandas like interface for query and join.,\n", - " developed_on: pyspark32_p38_cpu_feature_store_v1,\n", + " title: Schema Enforcement and Schema Evolution in Feature Store ,\n", + " summary: Perform Schema Enforcement and Schema Evolution in Feature Store when materialising the data.,\n", + " developed_on: fspyspark32_p38_cpu_v1,\n", " keywords: feature store, querying,\n", " license: Universal Permissive License v 1.0\n", "}" ] }, { - "cell_type": "raw", - "id": "dba1f334", + "cell_type": "code", + "execution_count": null, + "id": "997bb810", "metadata": { "ExecuteTime": { "end_time": "2023-05-24T08:26:08.572567Z", "start_time": "2023-05-24T08:26:08.328013Z" } }, + "outputs": [], "source": [ - "!odsc conda install --uri https://objectstorage.us-ashburn-1.oraclecloud.com/n/bigdatadatasciencelarge/b/service-conda-packs-fs/o/service_pack/cpu/PySpark_3.2_and_Feature_Store/1.0/fspyspark32_p38_cpu_v1" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "572d752e", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "/bin/bash: -c: line 0: syntax error near unexpected token `newline'\n", - "/bin/bash: -c: line 0: `odsc data-catalog config --authentication resource_principal --metastore '\n" - ] - } - ], - "source": [ - "!odsc data-catalog config --authentication resource_principal --metastore " + "# Upgrade Oracle ADS to pick up the latest preview version to maintain compatibility with Oracle Cloud Infrastructure.\n", + "!pip install --pre --no-deps oracle-ads==2.9.0rc0" ] }, { "cell_type": "markdown", - "id": "ebe05d00", + "id": "3dd0bbd5", "metadata": {}, "source": [ "Oracle Data Science service sample notebook.\n", @@ -63,7 +47,7 @@ "---\n", "# Overview:\n", "---\n", - "Managing many datasets, data-sources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift and training serving skew all leads to increased model development time and worse model performance. Here, feature store is well positioned to solve many of the problems since it provides a centralised way to transform and access data for training and serving time and helps defines a standardised pipeline for ingestion of data and querying of data. This notebook demonstrates how to use feature store within a long lasting [Oracle Cloud Infrastructure Data Flow](https://docs.oracle.com/en-us/iaas/data-flow/using/home.htm) cluster.\n", + "Managing many datasets, data-sources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift and training serving skew all leads to increased model development time and worse model performance. Here, feature store is well positioned to solve many of the problems since it provides a centralised way to transform and access data for training and serving time and helps defines a standardised pipeline for ingestion of data and querying of data. This notebook shows how schema enforcement and schema evolution are carried out in Feature Store\n", "\n", "Compatible conda pack: [PySpark 3.2 and Feature store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8\n", "\n", @@ -73,20 +57,23 @@ "\n", "## Contents:\n", "\n", - "- 1. Introduction\n", - "- 1. Pre-requisites\n", - " - 2.1 Policies\n", - " - 2.2 Authentication\n", - " - 2.3 Variables\n", + "- 1. Introduction\n", + "- 2. Pre-requisites\n", + " - 2.1 Setup\n", + " - 2.2 Policies\n", + " - 2.3 Authentication\n", + " - 2.4 Variables\n", "- 3. Schema enforcement and schema evolution\n", - " - 3.1. Exploration of data in feature store\n", - " - 3.2. Create feature store logical entities\n", + " - 3.1. Exploration of data in feature store\n", + " - 3.2. Create feature store logical entities\n", " - 3.3. Schema enforcement\n", " - 3.4. Ingestion Modes\n", " - 3.4.1 Append\n", " - 3.4.2 Overwrite\n", " - 3.4.3 Upsert\n", - "- 4. References\n", + " - 3.5. History\n", + " - 3.6. As_of Feature \n", + "- 4. References\n", "\n", "---\n", "\n", @@ -99,10 +86,10 @@ }, { "cell_type": "markdown", - "id": "6854cd38", + "id": "cc61a6ad", "metadata": {}, "source": [ - "\n", + "\n", "# 1. Introduction\n", "\n", "Oracle feature store is a stack based solution that is deployed in the customer enclave using OCI resource manager. Customer can stand up the service with infrastructure in their own tenancy. The service consists of API which are deployed in customer tenancy using resource manager.\n", @@ -127,24 +114,26 @@ }, { "cell_type": "markdown", - "id": "ae2fdf26", + "id": "10ada53a", "metadata": {}, "source": [ - "\n", - "# 2. Pre-requisites\n", - "\n", - "Data Flow Sessions are accessible through the following conda environment:\n", + "\n", + "# 2. Pre-requisites to Running this Notebook\n", + "Notebook Sessions are accessible through the following conda environment: \n", "\n", - "* **PySpark 3.2, Feature store 1.0 and Data Flow 1.0 (fs_pyspark32_p38_cpu_v1)**\n", + "* **PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1)**\n", "\n", - "The [Data Catalog Hive Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and unstructured data assets. The Metastore is the central metadata repository to understand tables backed by files on object storage. You can customize `fs_pyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Data Flow session cluster. The metastore id of hive metastore is tied to feature store construct of feature store service.\n" + "You can customize `fspyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Notebook session cluster. \n" ] }, { "cell_type": "markdown", - "id": "486e5d3f", + "id": "e519c49e", "metadata": {}, "source": [ + "\n", + "### 2.1. Setup\n", + "\n", "\n", "### `spark-defaults.conf`\n", "\n", @@ -167,21 +156,16 @@ "\n", "```bash\n", "odsc data-catalog config --help\n", - "```\n", - "\n", - "\n", - "### Session Setup\n", - "\n", - "The notebook makes connections to the Data Catalog metastore and Object Storage. In the next cell, specify the bucket URI to act as the data warehouse. Use the `warehouse_uri` variable with the `oci://@/` format. Update the variable `metastore_id` with the OCID of the Data Catalog metastore." + "```" ] }, { "cell_type": "markdown", - "id": "367ba357", + "id": "e840f262", "metadata": {}, "source": [ - "\n", - "### 2.1. Policies\n", + "\n", + "### 2.2. Policies\n", "This section covers the creation of dynamic groups and policies needed to use the service.\n", "\n", "* [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm/)\n", @@ -192,19 +176,19 @@ }, { "cell_type": "markdown", - "id": "52bea9cf", + "id": "eeec1d4d", "metadata": {}, "source": [ - "\n", - "### 2.2. Authentication\n", + "\n", + "### 2.3. Authentication\n", "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook cluster.
\n", "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```." ] }, { "cell_type": "code", - "execution_count": 1, - "id": "ac079f4b", + "execution_count": null, + "id": "233ac5e8", "metadata": { "ExecuteTime": { "start_time": "2023-05-24T08:26:08.577504Z" @@ -217,23 +201,23 @@ "outputs": [], "source": [ "import ads\n", - "ads.set_auth(auth=\"resource_principal\", client_kwargs={\"service_endpoint\": \"https://bi3jfhvilwl7gelzjbv3ovim2m.apigateway.us-ashburn-1.oci.customer-oci.com/20230101\"})" + "ads.set_auth(auth=\"resource_principal\", client_kwargs={\"fs_service_endpoint\": \"https://{api_gateway}/20230101\"})" ] }, { "cell_type": "markdown", - "id": "4df685c7", + "id": "429c36d6", "metadata": {}, "source": [ - "\n", - "### 2.3. Variables\n", + "\n", + "### 2.4. Variables\n", "To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `` and bucket `` for offline feature store." ] }, { "cell_type": "code", - "execution_count": 2, - "id": "963224c0", + "execution_count": null, + "id": "80e80a24", "metadata": { "pycharm": { "is_executing": true @@ -243,84 +227,30 @@ "source": [ "import os\n", "\n", - "compartment_id = \"ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\"\n", - "metastore_id = \"ocid1.datacatalogmetastore.oc1.iad.amaaaaaabiudgxyap7tizm4gscwz7amu7dixz7ml3mtesqzzwwg3urvvdgua\"" + "compartment_id = os.environ.get(\"NB_SESSION_COMPARTMENT_OCID\")\n", + "metastore_id = \"\"" ] }, { "cell_type": "markdown", - "id": "e3087df9", + "id": "e9f96e28", "metadata": {}, "source": [ "\n", "# 3. Schema enforcement and schema evolution\n", - "By default the **PySpark 3.2, Feature store and Data Flow** conda environment includes pre-installed [great-expectations](https://legacy.docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html) and [deeque](https://github.com/awslabs/deequ) libraries. The joining functionality is heavily inspired by the APIs used by Pandas to merge, join or filter DataFrames. The APIs allow you to specify which features to select from which feature group, how to join them and which features to use in join conditions.\n", - "\n" + "By default the **PySpark 3.2, Feature store and Data Flow** conda environment includes pre-installed [great-expectations](https://legacy.docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html).Schema enforcement is a Delta Lake feature that prevents you from appending data with a different schema to a table.To change a table's current schema and to accommodate data that is changing over time,Schema evolution feature is used while performing an append or overwrite operation." ] }, { "cell_type": "code", - "execution_count": 3, - "id": "9f18611c", + "execution_count": null, + "id": "b1169e3a", "metadata": { "pycharm": { "is_executing": true } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/ads/model/deployment/model_deployment.py:48: DeprecationWarning: The `ads.model.deployment.model_deployment_properties` is deprecated in `oracle-ads 2.8.6` and will be removed in `oracle-ads 3.0`.Use `ModelDeploymentInfrastructure` and `ModelDeploymentRuntime` classes in `ads.model.deployment` module for configuring model deployment. Check https://accelerated-data-science.readthedocs.io/en/latest/user_guide/model_registration/introduction.html\n", - " from .model_deployment_properties import ModelDeploymentProperties\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/ads/model/deployment/__init__.py:7: DeprecationWarning: The `ads.model.deployment.model_deployer` is deprecated in `oracle-ads 2.8.6` and will be removed in `oracle-ads 3.0`.Use `ModelDeployment` class in `ads.model.deployment` module for initializing and deploying model deployment. Check https://accelerated-data-science.readthedocs.io/en/latest/user_guide/model_registration/introduction.html\n", - " from .model_deployer import ModelDeployer\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/pandas/__init__.py:44: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " LooseVersion(pyarrow.__version__) >= LooseVersion(\"2.0.0\")\n", - "\n", - "WARNING:root:'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/pandas/frame.py:62: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pd.__version__) >= LooseVersion(\"0.24\"):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/pandas/missing/frame.py:81: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pd.__version__) < LooseVersion(\"1.0\"):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/pandas/missing/indexes.py:85: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pd.__version__) < LooseVersion(\"1.0\"):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/pandas/missing/indexes.py:191: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pd.__version__) < LooseVersion(\"1.0\"):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/pandas/missing/series.py:89: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pd.__version__) < LooseVersion(\"1.0\"):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/pandas/groupby.py:50: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pd.__version__) >= LooseVersion(\"1.3.0\"):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/fs/__init__.py:4: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('fs')`.\n", - "Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages\n", - " __import__(\"pkg_resources\").declare_namespace(__name__) # type: ignore\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/fs/opener/__init__.py:6: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('fs.opener')`.\n", - "Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages\n", - " __import__(\"pkg_resources\").declare_namespace(__name__) # type: ignore\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pkg_resources/__init__.py:2349: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('fs')`.\n", - "Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages\n", - " declare_namespace(parent)\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "import pandas as pd\n", "from ads.feature_store.feature_store import FeatureStore\n", @@ -336,10 +266,10 @@ }, { "cell_type": "markdown", - "id": "c72aef9f", + "id": "a9d0cad0", "metadata": {}, "source": [ - "\n", + "\n", "### 3.1. Exploration of data in feature store\n", "\n", "
\n", @@ -349,131 +279,10 @@ }, { "cell_type": "code", - "execution_count": 4, - "id": "d5a71a5f", + "execution_count": null, + "id": "8b59c7e4", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:py.warnings:/tmp/ipykernel_2623/906484602.py:1: DtypeWarning: Columns (7,8) have mixed types. Specify dtype option on import or set low_memory=False.\n", - " flights_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/flights.csv\")[['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']]\n", - "\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
YEARMONTHDAYDAY_OF_WEEKAIRLINEFLIGHT_NUMBERORIGIN_AIRPORTDESTINATION_AIRPORT
02015114AS98ANCSEA
12015114AA2336LAXPBI
22015114US840SFOCLT
32015114AA258LAXMIA
42015114AS135SEAANC
\n", - "
" - ], - "text/plain": [ - " YEAR MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT \\\n", - "0 2015 1 1 4 AS 98 ANC \n", - "1 2015 1 1 4 AA 2336 LAX \n", - "2 2015 1 1 4 US 840 SFO \n", - "3 2015 1 1 4 AA 258 LAX \n", - "4 2015 1 1 4 AS 135 SEA \n", - "\n", - " DESTINATION_AIRPORT \n", - "0 SEA \n", - "1 PBI \n", - "2 CLT \n", - "3 MIA \n", - "4 ANC " - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "flights_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/flights.csv\")[['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT']]\n", "flights_df = flights_df.head(100)\n", @@ -482,114 +291,14 @@ }, { "cell_type": "code", - "execution_count": 6, - "id": "5f26aa4e", + "execution_count": null, + "id": "6735e954", "metadata": { "pycharm": { "is_executing": true } }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IATA_CODEAIRPORTCITYSTATELATITUDELONGITUDE
0ABELehigh Valley International AirportAllentownPA40.65236-75.44040
1ABIAbilene Regional AirportAbileneTX32.41132-99.68190
2ABQAlbuquerque International SunportAlbuquerqueNM35.04022-106.60919
3ABRAberdeen Regional AirportAberdeenSD45.44906-98.42183
4ABYSouthwest Georgia Regional AirportAlbanyGA31.53552-84.19447
\n", - "
" - ], - "text/plain": [ - " IATA_CODE AIRPORT CITY STATE LATITUDE \\\n", - "0 ABE Lehigh Valley International Airport Allentown PA 40.65236 \n", - "1 ABI Abilene Regional Airport Abilene TX 32.41132 \n", - "2 ABQ Albuquerque International Sunport Albuquerque NM 35.04022 \n", - "3 ABR Aberdeen Regional Airport Aberdeen SD 45.44906 \n", - "4 ABY Southwest Georgia Regional Airport Albany GA 31.53552 \n", - "\n", - " LONGITUDE \n", - "0 -75.44040 \n", - "1 -99.68190 \n", - "2 -106.60919 \n", - "3 -98.42183 \n", - "4 -84.19447 " - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "columns = ['IATA_CODE', 'AIRPORT', 'CITY', 'STATE', 'LATITUDE', 'LONGITUDE']\n", "airports_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airports.csv\")[columns]\n", @@ -598,83 +307,14 @@ }, { "cell_type": "code", - "execution_count": 7, - "id": "fdab3e5c", + "execution_count": null, + "id": "363c818b", "metadata": { "pycharm": { "is_executing": true } }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IATA_CODEAIRLINE
0UAUnited Air Lines Inc.
1AAAmerican Airlines Inc.
2USUS Airways Inc.
3F9Frontier Airlines Inc.
4B6JetBlue Airways
\n", - "
" - ], - "text/plain": [ - " IATA_CODE AIRLINE\n", - "0 UA United Air Lines Inc.\n", - "1 AA American Airlines Inc.\n", - "2 US US Airways Inc.\n", - "3 F9 Frontier Airlines Inc.\n", - "4 B6 JetBlue Airways" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "airlines_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airlines.csv\")\n", "airlines_df.head()" @@ -682,16 +322,16 @@ }, { "cell_type": "markdown", - "id": "9fd09cb0", + "id": "4c800a75", "metadata": {}, "source": [ - "\n", + "\n", "### 3.2. Create feature store logical entities" ] }, { "cell_type": "markdown", - "id": "2fce1574", + "id": "ab64f16f", "metadata": {}, "source": [ "#### 3.2.1 Feature Store\n", @@ -700,8 +340,8 @@ }, { "cell_type": "code", - "execution_count": 8, - "id": "4104e209", + "execution_count": null, + "id": "01c4dc79", "metadata": { "pycharm": { "is_executing": true @@ -720,7 +360,7 @@ }, { "cell_type": "markdown", - "id": "adeb7bb8", + "id": "d6c3e1bf", "metadata": {}, "source": [ "\n", @@ -731,34 +371,14 @@ }, { "cell_type": "code", - "execution_count": 9, - "id": "a6f2e337", + "execution_count": null, + "id": "35d70317", "metadata": { "pycharm": { "is_executing": true } }, - "outputs": [ - { - "data": { - "text/plain": [ - "\n", - "kind: featurestore\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\n", - " description: Data consisting of flights\n", - " displayName: flights details\n", - " id: EA128EDAE4380286A842064AF466A685\n", - " offlineConfig:\n", - " metastoreId: ocid1.datacatalogmetastore.oc1.iad.amaaaaaabiudgxyap7tizm4gscwz7amu7dixz7ml3mtesqzzwwg3urvvdgua\n", - "type: featureStore" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "feature_store = feature_store_resource.create()\n", "feature_store" @@ -766,7 +386,7 @@ }, { "cell_type": "markdown", - "id": "d28d15e1", + "id": "de92fc24", "metadata": {}, "source": [ "#### 3.2.2 Entity\n", @@ -775,29 +395,10 @@ }, { "cell_type": "code", - "execution_count": 10, - "id": "ee0f393e", + "execution_count": null, + "id": "39087c3a", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "\n", - "kind: entity\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\n", - " description: description for flight details\n", - " featureStoreId: EA128EDAE4380286A842064AF466A685\n", - " id: 55EB4FC9F3D8AEE40442046F7B7EE92C\n", - " name: Flight details schema evolution/enforcement\n", - "type: entity" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "entity = feature_store.create_entity(\n", " display_name=\"Flight details schema evolution/enforcement\",\n", @@ -808,7 +409,7 @@ }, { "cell_type": "markdown", - "id": "6998d51a", + "id": "33485b3e", "metadata": {}, "source": [ "\n", @@ -819,21 +420,10 @@ }, { "cell_type": "code", - "execution_count": 11, - "id": "5ca56127", + "execution_count": null, + "id": "13ff8e8c", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{\"meta\": {}, \"expectation_type\": \"expect_column_values_to_be_between\", \"kwargs\": {\"column\": \"LONGITUDE\", \"min_value\": -1.0, \"max_value\": 1.0}}" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "from great_expectations.core import ExpectationSuite, ExpectationConfiguration\n", "\n", @@ -863,35 +453,10 @@ }, { "cell_type": "code", - "execution_count": 12, - "id": "bb60c0ad", + "execution_count": null, + "id": "66fae082", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting default log level to \"WARN\".\n", - "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", - "2023/07/25 10:07:54 NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:471: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.\n", - " arrow_data = [[(c, t) for (_, c), t in zip(pdf_slice.iteritems(), arrow_types)]\n", - "\n" - ] - } - ], + "outputs": [], "source": [ "feature_group_airports = (\n", " FeatureGroup()\n", @@ -910,487 +475,42 @@ }, { "cell_type": "code", - "execution_count": 13, - "id": "37159872", + "execution_count": null, + "id": "d966fc78", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, - "outputs": [ - { - "data": { - "text/plain": [ - "\n", - "kind: FeatureGroup\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\n", - " entityId: 55EB4FC9F3D8AEE40442046F7B7EE92C\n", - " expectationDetails:\n", - " createRuleDetails:\n", - " - arguments:\n", - " column: IATA_CODE\n", - " levelType: ERROR\n", - " name: Rule-0\n", - " ruleType: expect_column_values_to_not_be_null\n", - " - arguments:\n", - " column: LATITUDE\n", - " max_value: 1.0\n", - " min_value: -1.0\n", - " levelType: ERROR\n", - " name: Rule-1\n", - " ruleType: expect_column_values_to_be_between\n", - " - arguments:\n", - " column: LONGITUDE\n", - " max_value: 1.0\n", - " min_value: -1.0\n", - " levelType: ERROR\n", - " name: Rule-2\n", - " ruleType: expect_column_values_to_be_between\n", - " expectationType: LENIENT\n", - " name: test_airports_df\n", - " validationEngineType: GREAT_EXPECTATIONS\n", - " featureStoreId: EA128EDAE4380286A842064AF466A685\n", - " id: 26DE61A551F8BF29F132FF03B62B3E67\n", - " inputFeatureDetails:\n", - " - featureType: STRING\n", - " name: IATA_CODE\n", - " orderNumber: 1\n", - " - featureType: STRING\n", - " name: AIRPORT\n", - " orderNumber: 2\n", - " - featureType: STRING\n", - " name: CITY\n", - " orderNumber: 3\n", - " - featureType: STRING\n", - " name: STATE\n", - " orderNumber: 4\n", - " - featureType: DOUBLE\n", - " name: LATITUDE\n", - " orderNumber: 5\n", - " - featureType: DOUBLE\n", - " name: LONGITUDE\n", - " orderNumber: 6\n", - " isInferSchema: true\n", - " name: airport_feature_group\n", - " primaryKeys:\n", - " items:\n", - " - name: IATA_CODE\n", - " statisticsConfig:\n", - " isEnabled: true\n", - "type: featureGroup" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "feature_group_airports.create()" ] }, { "cell_type": "code", - "execution_count": 14, - "id": "7ac26507", + "execution_count": null, + "id": "e4bbefa2", "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "%3\n", - "\n", - "\n", - "EA128EDAE4380286A842064AF466A685\n", - "\n", - "flights details\n", - "Feature Store\n", - "EA128EDAE4380286A842064AF466A685\n", - "\n", - "\n", - "55EB4FC9F3D8AEE40442046F7B7EE92C\n", - "\n", - "Flight details schema evolution/enforcement\n", - "Entity\n", - "55EB4FC9F3D8AEE40442046F7B7EE92C\n", - "\n", - "\n", - "EA128EDAE4380286A842064AF466A685->55EB4FC9F3D8AEE40442046F7B7EE92C\n", - "\n", - "\n", - "\n", - "\n", - "26DE61A551F8BF29F132FF03B62B3E67\n", - "\n", - "airport_feature_group\n", - "Feature Group\n", - "26DE61A551F8BF29F132FF03B62B3E67\n", - "\n", - "\n", - "55EB4FC9F3D8AEE40442046F7B7EE92C->26DE61A551F8BF29F132FF03B62B3E67\n", - "\n", - "\n", - "\n", - "\n", - "\n" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "feature_group_airports.show()" ] }, { "cell_type": "code", - "execution_count": 15, - "id": "1a1ddd95", + "execution_count": null, + "id": "9f9519e6", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Hive Session ID = cdd7eb82-a9e8-4f1b-bdad-93400dab3a3a\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:great_expectations.validator.validator:\t3 expectation(s) included in expectation_suite.\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "676d7a993ba94ba2a8fe00292890547b", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Calculating Metrics: 0%| | 0/16 [00:00 (0 + 2) / 2]\r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43f629f0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d43f6e970>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43f6ef70>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d43f6ef30>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43f521b0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d43f52230>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43f52670>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d43f52270>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43f527f0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=38.9812439184953, minimum=13.48345, maximum=71.28545, central_moments=[1.0, 8.909626780690911e-17, 74.01537930806269, 262.87069420949706, 26574.825385423774]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d43f52df0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d43f526b0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43f60bb0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=-98.37896445141065, minimum=-176.64603, maximum=-64.79856, central_moments=[1.0, 0.0, 461.80848194502215, -11904.62460720004, 932401.3978279813]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d43f607f0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d43f600f0>)} sfc map\n", - "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", - "INFO:mlm_insights.builder:Profile Generated Successfully\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 14, 'percentage': 4.3478260869565215}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 308.000234832572\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='TX', estimate=24, lower_bound=24, upper_bound=24), FrequentItemEstimate(value='CA', estimate=22, lower_bound=22, upper_bound=22), FrequentItemEstimate(value='AK', estimate=19, lower_bound=19, upper_bound=19), FrequentItemEstimate(value='FL', estimate=17, lower_bound=17, upper_bound=17), FrequentItemEstimate(value='MI', estimate=15, lower_bound=15, upper_bound=15), FrequentItemEstimate(value='NY', estimate=14, lower_bound=14, upper_bound=14), FrequentItemEstimate(value='CO', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='NC', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='MN', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='WI', estimate=8, lower_bound=8, upper_bound=8)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 268, 'percentage': 83.22981366459628}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['TX']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 54.00000710785499\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: 0.41281856359758584\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 8.603219124726667\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 13.48345\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 9.529050000000005\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 57.802\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'density': [0.003134796238244514, 0.0, 0.015673981191222573, 0.01567398119122257, 0.0031347962382445166, 0.0, 0.025078369905956105, 0.021943573667711602, 0.07210031347962384, 0.07836990595611285, 0.10658307210031348, 0.0658307210031348, 0.09404388714733536, 0.11598746081504707, 0.13479623824451414, 0.07836990595611282, 0.06896551724137934, 0.037617554858934144, 0.0, 0.006269592476489061, 0.0, 0.01253918495297801, 0.01567398119122254, 0.012539184952978122, 0.0, 0.0031347962382445305, 0.0031347962382444194, 0.0, 0.0031347962382445305, 0.006269592476489061]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 74.01537930806269\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'frequency': [1, 0, 5, 5, 1, 0, 8, 7, 23, 25, 34, 21, 30, 37, 43, 25, 22, 12, 0, 2, 0, 4, 5, 4, 0, 1, 1, 0, 1, 2]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 71.28545\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 12435.01681\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 33.64044, 'q2': 39.29761, 'q3': 43.16949}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 38.9812439184953\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.850946460274213\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: -1.199562407919743\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 21.489729685247838\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: -176.64603\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 28.225759999999994\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 111.84747\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'density': [0.006269592476489028, 0.003134796238244515, 0.003134796238244513, 0.003134796238244513, 0.009404388714733543, 0.01567398119122257, 0.006269592476489033, 0.009404388714733543, 0.006269592476489019, 0.0, 0.00940438871473355, 0.012539184952978052, 0.0, 0.012539184952978052, 0.05642633228840126, 0.040752351097178674, 0.05642633228840124, 0.028213166144200663, 0.05015673981191221, 0.03134796238244514, 0.09090909090909094, 0.09090909090909094, 0.08150470219435735, 0.10031347962382442, 0.09404388714733547, 0.08150470219435735, 0.056426332288401215, 0.028213166144200663, 0.01567398119122254, 0.0]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 461.80848194502215\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'frequency': [2, 1, 1, 1, 3, 5, 2, 3, 2, 0, 3, 4, 0, 4, 18, 13, 18, 9, 16, 10, 29, 29, 26, 32, 30, 26, 18, 9, 5, 0]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: -64.79856\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: -31382.88966\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': -111.11764, 'q2': -93.66068, 'q3': -82.89188}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: -98.37896445141065\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.3719894513293207\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 322.0\n", - "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", - "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", - "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", - "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", - "│ 26DE61A551F8BF29F132FF03B62B3E67 │ FEATURE_GROUP │ Succeeded │ None │\n", - "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" - ] - } - ], + "outputs": [], "source": [ "feature_group_airports.materialise(airports_df)" ] }, { "cell_type": "markdown", - "id": "a8b2d54e", + "id": "dff776cc", "metadata": {}, "source": [ "\n", @@ -1401,116 +521,10 @@ }, { "cell_type": "code", - "execution_count": 17, - "id": "f6d46a5e", + "execution_count": null, + "id": "1791d8f0", "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IATA_CODEAIRPORTCITYSTATELATITUDELONGITUDECOUNTRY
0ABELehigh Valley International AirportAllentownPA40.65236-75.44040USA
1ABIAbilene Regional AirportAbileneTX32.41132-99.68190USA
2ABQAlbuquerque International SunportAlbuquerqueNM35.04022-106.60919USA
3ABRAberdeen Regional AirportAberdeenSD45.44906-98.42183USA
4ABYSouthwest Georgia Regional AirportAlbanyGA31.53552-84.19447USA
\n", - "
" - ], - "text/plain": [ - " IATA_CODE AIRPORT CITY STATE LATITUDE \\\n", - "0 ABE Lehigh Valley International Airport Allentown PA 40.65236 \n", - "1 ABI Abilene Regional Airport Abilene TX 32.41132 \n", - "2 ABQ Albuquerque International Sunport Albuquerque NM 35.04022 \n", - "3 ABR Aberdeen Regional Airport Aberdeen SD 45.44906 \n", - "4 ABY Southwest Georgia Regional Airport Albany GA 31.53552 \n", - "\n", - " LONGITUDE COUNTRY \n", - "0 -75.44040 USA \n", - "1 -99.68190 USA \n", - "2 -106.60919 USA \n", - "3 -98.42183 USA \n", - "4 -84.19447 USA " - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "columns = ['IATA_CODE', 'AIRPORT', 'CITY', 'STATE', 'LATITUDE', 'LONGITUDE', 'COUNTRY']\n", "airports_df = pd.read_csv(\"https://objectstorage.us-ashburn-1.oraclecloud.com/p/hh2NOgFJbVSg4amcLM3G3hkTuHyBD-8aE_iCsuZKEvIav1Wlld-3zfCawG4ycQGN/n/ociodscdev/b/oci-feature-store/o/beta/data/flights/airports.csv\")[columns]\n", @@ -1519,117 +533,10 @@ }, { "cell_type": "code", - "execution_count": 18, - "id": "39722b5f", + "execution_count": null, + "id": "c6357225", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:471: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.\n", - " arrow_data = [[(c, t) for (_, c), t in zip(pdf_slice.iteritems(), arrow_types)]\n", - "\n" - ] - }, - { - "data": { - "text/plain": [ - "\n", - "kind: FeatureGroup\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\n", - " entityId: 55EB4FC9F3D8AEE40442046F7B7EE92C\n", - " expectationDetails:\n", - " createRuleDetails:\n", - " - arguments:\n", - " column: IATA_CODE\n", - " levelType: ERROR\n", - " name: Rule-0\n", - " ruleType: expect_column_values_to_not_be_null\n", - " - arguments:\n", - " column: LATITUDE\n", - " max_value: 1.0\n", - " min_value: -1.0\n", - " levelType: ERROR\n", - " name: Rule-1\n", - " ruleType: expect_column_values_to_be_between\n", - " - arguments:\n", - " column: LONGITUDE\n", - " max_value: 1.0\n", - " min_value: -1.0\n", - " levelType: ERROR\n", - " name: Rule-2\n", - " ruleType: expect_column_values_to_be_between\n", - " expectationType: LENIENT\n", - " name: test_airports_df\n", - " validationEngineType: GREAT_EXPECTATIONS\n", - " featureStoreId: EA128EDAE4380286A842064AF466A685\n", - " id: 26DE61A551F8BF29F132FF03B62B3E67\n", - " inputFeatureDetails:\n", - " - featureType: STRING\n", - " name: IATA_CODE\n", - " orderNumber: 1\n", - " - featureType: STRING\n", - " name: AIRPORT\n", - " orderNumber: 2\n", - " - featureType: STRING\n", - " name: CITY\n", - " orderNumber: 3\n", - " - featureType: STRING\n", - " name: STATE\n", - " orderNumber: 4\n", - " - featureType: DOUBLE\n", - " name: LATITUDE\n", - " orderNumber: 5\n", - " - featureType: DOUBLE\n", - " name: LONGITUDE\n", - " orderNumber: 6\n", - " - featureType: STRING\n", - " name: COUNTRY\n", - " orderNumber: 7\n", - " isInferSchema: true\n", - " jobId: 9e11aebd-3ab1-4da3-a6dd-aa90bd1be5f7\n", - " name: airport_feature_group\n", - " outputFeatureDetails:\n", - " items:\n", - " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", - " featureType: STRING\n", - " name: IATA_CODE\n", - " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", - " featureType: STRING\n", - " name: AIRPORT\n", - " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", - " featureType: STRING\n", - " name: CITY\n", - " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", - " featureType: STRING\n", - " name: STATE\n", - " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", - " featureType: DOUBLE\n", - " name: LATITUDE\n", - " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", - " featureType: DOUBLE\n", - " name: LONGITUDE\n", - " primaryKeys:\n", - " items:\n", - " - name: IATA_CODE\n", - " statisticsConfig:\n", - " isEnabled: true\n", - "type: featureGroup" - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "feature_group_airports.with_schema_details_from_dataframe(airports_df)\n", "feature_group_airports.update()" @@ -1637,133 +544,17 @@ }, { "cell_type": "code", - "execution_count": 19, - "id": "3ad0d743", + "execution_count": null, + "id": "7dc15e14", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:great_expectations.validator.validator:\t3 expectation(s) included in expectation_suite.\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "09a87a20c8af48beaf230a58ee2b1609", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Calculating Metrics: 0%| | 0/16 [00:00 with error message: A schema mismatch detected when writing to the Delta table (Table ID: 020a3b36-917b-4fdc-890f-4fa27abdd809).\n", - "To enable schema migration using DataFrameWriter or DataStreamWriter, please set:\n", - "'.option(\"mergeSchema\", \"true\")'.\n", - "For other operations, set the session configuration\n", - "spark.databricks.delta.schema.autoMerge.enabled to \"true\". See the documentation\n", - "specific to the operation for details.\n", - "\n", - "Table schema:\n", - "root\n", - "-- IATA_CODE: string (nullable = true)\n", - "-- AIRPORT: string (nullable = true)\n", - "-- CITY: string (nullable = true)\n", - "-- STATE: string (nullable = true)\n", - "-- LATITUDE: double (nullable = true)\n", - "-- LONGITUDE: double (nullable = true)\n", - "\n", - "\n", - "Data schema:\n", - "root\n", - "-- IATA_CODE: string (nullable = true)\n", - "-- AIRPORT: string (nullable = true)\n", - "-- CITY: string (nullable = true)\n", - "-- STATE: string (nullable = true)\n", - "-- LATITUDE: double (nullable = true)\n", - "-- LONGITUDE: double (nullable = true)\n", - "-- COUNTRY: string (nullable = true)\n", - "\n", - " \n", - "To overwrite your schema or change partitioning, please set:\n", - "'.option(\"overwriteSchema\", \"true\")'.\n", - "\n", - "Note that the schema can't be overwritten when using\n", - "'replaceWhere'.\n", - " \n", - "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", - "╒══════════════════════════════════╤═══════════════╤════════════════════╤══════════════════════════════════════════════════════════════════════════════════════════════════════════════╕\n", - "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", - "╞══════════════════════════════════╪═══════════════╪════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════╡\n", - "│ 26DE61A551F8BF29F132FF03B62B3E67 │ FEATURE_GROUP │ Failed │ A schema mismatch detected when writing to the Delta table (Table ID: 020a3b36-917b-4fdc-890f-4fa27abdd809). │\n", - "│ │ │ │ To enable schema migration using DataFrameWriter or DataStreamWriter, please set: │\n", - "│ │ │ │ '.option(\"mergeSchema\", \"true\")'. │\n", - "│ │ │ │ For other operations, set the session configuration │\n", - "│ │ │ │ spark.databricks.delta.schema.autoMerge.enabled to \"true\". See the documentation │\n", - "│ │ │ │ specific to the operation for details. │\n", - "│ │ │ │ │\n", - "│ │ │ │ Table schema: │\n", - "│ │ │ │ root │\n", - "│ │ │ │ -- IATA_CODE: string (nullable = true) │\n", - "│ │ │ │ -- AIRPORT: string (nullable = true) │\n", - "│ │ │ │ -- CITY: string (nullable = true) │\n", - "│ │ │ │ -- STATE: string (nullable = true) │\n", - "│ │ │ │ -- LATITUDE: double (nullable = true) │\n", - "│ │ │ │ -- LONGITUDE: double (nullable = true) │\n", - "│ │ │ │ │\n", - "│ │ │ │ │\n", - "│ │ │ │ Data schema: │\n", - "│ │ │ │ root │\n", - "│ │ │ │ -- IATA_CODE: string (nullable = true) │\n", - "│ │ │ │ -- AIRPORT: string (nullable = true) │\n", - "│ │ │ │ -- CITY: string (nullable = true) │\n", - "│ │ │ │ -- STATE: string (nullable = true) │\n", - "│ │ │ │ -- LATITUDE: double (nullable = true) │\n", - "│ │ │ │ -- LONGITUDE: double (nullable = true) │\n", - "│ │ │ │ -- COUNTRY: string (nullable = true) │\n", - "│ │ │ │ │\n", - "│ │ │ │ │\n", - "│ │ │ │ To overwrite your schema or change partitioning, please set: │\n", - "│ │ │ │ '.option(\"overwriteSchema\", \"true\")'. │\n", - "│ │ │ │ │\n", - "│ │ │ │ Note that the schema can't be overwritten when using │\n", - "│ │ │ │ 'replaceWhere'. │\n", - "╘══════════════════════════════════╧═══════════════╧════════════════════╧══════════════════════════════════════════════════════════════════════════════════════════════════════════════╛\n" - ] - } - ], + "outputs": [], "source": [ "feature_group_airports.materialise(airports_df)" ] }, { "cell_type": "markdown", - "id": "42495d16", + "id": "107c8b58", "metadata": {}, "source": [ "\n", @@ -1774,8 +565,8 @@ }, { "cell_type": "code", - "execution_count": 20, - "id": "8374d4c3", + "execution_count": null, + "id": "aeba1145", "metadata": {}, "outputs": [], "source": [ @@ -1785,313 +576,10 @@ }, { "cell_type": "code", - "execution_count": 21, - "id": "31e59f5b", + "execution_count": null, + "id": "42e74b33", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:great_expectations.validator.validator:\t3 expectation(s) included in expectation_suite.\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "a03758c89f9147d785de310b66f43c6c", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Calculating Metrics: 0%| | 0/16 [00:00), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d440f87b0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d440f8cf0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d440f8df0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d440fef30>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d440fecf0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d440fe7f0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d440fe170>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d440fe7b0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=38.9812439184953, minimum=13.48345, maximum=71.28545, central_moments=[1.0, 8.909626780690911e-17, 74.01537930806269, 262.87069420949706, 26574.825385423774]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d440fed30>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d440fee70>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d4410b170>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=-98.37896445141065, minimum=-176.64603, maximum=-64.79856, central_moments=[1.0, 0.0, 461.80848194502215, -11904.62460720004, 932401.3978279813]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d4410b370>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d4410b270>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d4410b570>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d4410b5b0>)} sfc map\n", - "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", - "INFO:mlm_insights.builder:Profile Generated Successfully\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 14, 'percentage': 4.3478260869565215}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 308.000234832572\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='TX', estimate=24, lower_bound=24, upper_bound=24), FrequentItemEstimate(value='CA', estimate=22, lower_bound=22, upper_bound=22), FrequentItemEstimate(value='AK', estimate=19, lower_bound=19, upper_bound=19), FrequentItemEstimate(value='FL', estimate=17, lower_bound=17, upper_bound=17), FrequentItemEstimate(value='MI', estimate=15, lower_bound=15, upper_bound=15), FrequentItemEstimate(value='NY', estimate=14, lower_bound=14, upper_bound=14), FrequentItemEstimate(value='CO', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='NC', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='MN', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='WI', estimate=8, lower_bound=8, upper_bound=8)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 268, 'percentage': 83.22981366459628}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['TX']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 54.00000710785499\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: 0.41281856359758584\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 8.603219124726667\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 13.48345\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 9.529050000000005\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 57.802\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'density': [0.003134796238244514, 0.0, 0.015673981191222573, 0.01567398119122257, 0.0031347962382445166, 0.0, 0.025078369905956105, 0.021943573667711602, 0.07210031347962384, 0.07836990595611285, 0.10658307210031348, 0.0658307210031348, 0.09404388714733536, 0.11598746081504707, 0.13479623824451414, 0.07836990595611282, 0.06896551724137934, 0.037617554858934144, 0.0, 0.006269592476489061, 0.0, 0.01253918495297801, 0.01567398119122254, 0.012539184952978122, 0.0, 0.0031347962382445305, 0.0031347962382444194, 0.0, 0.0031347962382445305, 0.006269592476489061]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 74.01537930806269\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'frequency': [1, 0, 5, 5, 1, 0, 8, 7, 23, 25, 34, 21, 30, 37, 43, 25, 22, 12, 0, 2, 0, 4, 5, 4, 0, 1, 1, 0, 1, 2]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 71.28545\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 12435.01681\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 33.64044, 'q2': 39.29761, 'q3': 43.16949}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 38.9812439184953\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.850946460274213\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: -1.199562407919743\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 21.489729685247838\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: -176.64603\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 28.225759999999994\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 111.84747\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'density': [0.006269592476489028, 0.003134796238244515, 0.003134796238244513, 0.003134796238244513, 0.009404388714733543, 0.01567398119122257, 0.006269592476489033, 0.009404388714733543, 0.006269592476489019, 0.0, 0.00940438871473355, 0.012539184952978052, 0.0, 0.012539184952978052, 0.05642633228840126, 0.040752351097178674, 0.05642633228840124, 0.028213166144200663, 0.05015673981191221, 0.03134796238244514, 0.09090909090909094, 0.09090909090909094, 0.08150470219435735, 0.10031347962382442, 0.09404388714733547, 0.08150470219435735, 0.056426332288401215, 0.028213166144200663, 0.01567398119122254, 0.0]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 461.80848194502215\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'frequency': [2, 1, 1, 1, 3, 5, 2, 3, 2, 0, 3, 4, 0, 4, 18, 13, 18, 9, 16, 10, 29, 29, 26, 32, 30, 26, 18, 9, 5, 0]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: -64.79856\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: -31382.88966\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': -111.11764, 'q2': -93.66068, 'q3': -82.89188}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: -98.37896445141065\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.3719894513293207\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='USA', estimate=322, lower_bound=322, upper_bound=322)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 321, 'percentage': 99.68944099378882}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['USA']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 322.0\n", - "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", - "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", - "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", - "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", - "│ 26DE61A551F8BF29F132FF03B62B3E67 │ FEATURE_GROUP │ Succeeded │ None │\n", - "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" - ] - } - ], + "outputs": [], "source": [ "feature_group_airports.materialise(\n", " input_dataframe=airports_df,\n", @@ -2101,112 +589,17 @@ }, { "cell_type": "code", - "execution_count": 22, - "id": "a4a9d4fb", + "execution_count": null, + "id": "f4e11f65", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "\n", - "kind: FeatureGroup\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa462hfhplpx652b32ix62xrdijppq2c7okwcqjlgrbknhgtj2kofa\n", - " entityId: 55EB4FC9F3D8AEE40442046F7B7EE92C\n", - " expectationDetails:\n", - " createRuleDetails:\n", - " - arguments:\n", - " column: IATA_CODE\n", - " levelType: ERROR\n", - " name: Rule-0\n", - " ruleType: expect_column_values_to_not_be_null\n", - " - arguments:\n", - " column: LATITUDE\n", - " max_value: 1.0\n", - " min_value: -1.0\n", - " levelType: ERROR\n", - " name: Rule-1\n", - " ruleType: expect_column_values_to_be_between\n", - " - arguments:\n", - " column: LONGITUDE\n", - " max_value: 1.0\n", - " min_value: -1.0\n", - " levelType: ERROR\n", - " name: Rule-2\n", - " ruleType: expect_column_values_to_be_between\n", - " expectationType: LENIENT\n", - " name: test_airports_df\n", - " validationEngineType: GREAT_EXPECTATIONS\n", - " featureStoreId: EA128EDAE4380286A842064AF466A685\n", - " id: 26DE61A551F8BF29F132FF03B62B3E67\n", - " inputFeatureDetails:\n", - " - featureType: STRING\n", - " name: IATA_CODE\n", - " orderNumber: 1\n", - " - featureType: STRING\n", - " name: AIRPORT\n", - " orderNumber: 2\n", - " - featureType: STRING\n", - " name: CITY\n", - " orderNumber: 3\n", - " - featureType: STRING\n", - " name: STATE\n", - " orderNumber: 4\n", - " - featureType: DOUBLE\n", - " name: LATITUDE\n", - " orderNumber: 5\n", - " - featureType: DOUBLE\n", - " name: LONGITUDE\n", - " orderNumber: 6\n", - " - featureType: STRING\n", - " name: COUNTRY\n", - " orderNumber: 7\n", - " isInferSchema: true\n", - " jobId: 6e6a6d07-6a8f-4ea4-8508-264054f4dfb5\n", - " name: airport_feature_group\n", - " outputFeatureDetails:\n", - " items:\n", - " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", - " featureType: STRING\n", - " name: IATA_CODE\n", - " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", - " featureType: STRING\n", - " name: AIRPORT\n", - " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", - " featureType: STRING\n", - " name: CITY\n", - " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", - " featureType: STRING\n", - " name: STATE\n", - " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", - " featureType: DOUBLE\n", - " name: LATITUDE\n", - " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", - " featureType: DOUBLE\n", - " name: LONGITUDE\n", - " - featureGroupId: 26DE61A551F8BF29F132FF03B62B3E67\n", - " featureType: STRING\n", - " name: COUNTRY\n", - " primaryKeys:\n", - " items:\n", - " - name: IATA_CODE\n", - " statisticsConfig:\n", - " isEnabled: true\n", - "type: featureGroup" - ] - }, - "execution_count": 22, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "feature_group_airports" ] }, { "cell_type": "markdown", - "id": "83082aa3", + "id": "a30c68c3", "metadata": {}, "source": [ "\n", @@ -2215,7 +608,7 @@ }, { "cell_type": "markdown", - "id": "4a3ea8b7", + "id": "43eb6897", "metadata": {}, "source": [ "\n", @@ -2226,322 +619,10 @@ }, { "cell_type": "code", - "execution_count": 23, - "id": "5983a241", + "execution_count": null, + "id": "c6587f2e", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:great_expectations.validator.validator:\t3 expectation(s) included in expectation_suite.\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "af963256ff4d4bf9946faaa2f0229975", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Calculating Metrics: 0%| | 0/16 [00:00), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d440e05f0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d440f5230>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44045070>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d44045a30>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d440450f0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43f629b0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44045370>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43ea99b0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=38.9812439184953, minimum=13.48345, maximum=71.28545, central_moments=[1.0, 8.909626780690911e-17, 74.01537930806269, 262.87069420949706, 26574.825385423774]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d46ce1670>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d43f62ab0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d45082cb0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=-98.37896445141065, minimum=-176.64603, maximum=-64.79856, central_moments=[1.0, 0.0, 461.80848194502215, -11904.62460720004, 932401.3978279813]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d43ead3b0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d43ead730>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d43f595b0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d43ead6f0>)} sfc map\n", - "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", - "INFO:mlm_insights.builder:Profile Generated Successfully\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 14, 'percentage': 4.3478260869565215}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 308.000234832572\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='TX', estimate=24, lower_bound=24, upper_bound=24), FrequentItemEstimate(value='CA', estimate=22, lower_bound=22, upper_bound=22), FrequentItemEstimate(value='AK', estimate=19, lower_bound=19, upper_bound=19), FrequentItemEstimate(value='FL', estimate=17, lower_bound=17, upper_bound=17), FrequentItemEstimate(value='MI', estimate=15, lower_bound=15, upper_bound=15), FrequentItemEstimate(value='NY', estimate=14, lower_bound=14, upper_bound=14), FrequentItemEstimate(value='CO', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='NC', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='MN', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='WI', estimate=8, lower_bound=8, upper_bound=8)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 268, 'percentage': 83.22981366459628}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['TX']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 54.00000710785499\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: 0.41281856359758584\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 8.603219124726667\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 13.48345\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 9.606550000000006\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 57.802\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'density': [0.009404388714733543, 0.0, 0.01567398119122257, 0.009404388714733543, 0.0031347962382445166, 0.006269592476489026, 0.025078369905956112, 0.01567398119122257, 0.07210031347962384, 0.07836990595611285, 0.10658307210031348, 0.07210031347962381, 0.08777429467084635, 0.12225705329153613, 0.12852664576802508, 0.07836990595611282, 0.07523510971786829, 0.037617554858934255, 0.0, 0.0, 0.0, 0.01253918495297801, 0.01567398119122254, 0.012539184952978122, 0.0, 0.00940438871473348, 0.0031347962382445305, 0.0, 0.0031347962382445305, 0.0]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 74.01537930806269\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'frequency': [3, 0, 5, 3, 1, 2, 8, 5, 23, 25, 34, 23, 28, 39, 41, 25, 24, 12, 0, 0, 0, 4, 5, 4, 0, 3, 1, 0, 1, 0]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 71.28545\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 12435.01681\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 33.56294, 'q2': 39.29761, 'q3': 43.16949}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 38.9812439184953\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.850946460274213\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: -1.199562407919743\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 21.489729685247838\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: -176.64603\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 28.386920000000003\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 111.84747\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'density': [0.0, 0.003134796238244514, 0.003134796238244514, 0.009404388714733541, 0.009404388714733543, 0.01567398119122257, 0.006269592476489033, 0.0031347962382445096, 0.006269592476489033, 0.006269592476489019, 0.00940438871473355, 0.006269592476489033, 0.0, 0.018808777429467072, 0.05642633228840126, 0.040752351097178674, 0.05015673981191224, 0.03448275862068967, 0.043887147335423204, 0.037617554858934144, 0.09090909090909094, 0.08463949843260188, 0.08777429467084641, 0.10031347962382442, 0.09404388714733547, 0.08150470219435735, 0.056426332288401215, 0.028213166144200663, 0.00940438871473348, 0.006269592476489061]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 461.80848194502215\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'frequency': [0, 1, 1, 3, 3, 5, 2, 1, 2, 2, 3, 2, 0, 6, 18, 13, 16, 11, 14, 12, 29, 27, 28, 32, 30, 26, 18, 9, 3, 2]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: -64.79856\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: -31382.88966\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': -110.94103, 'q2': -93.40307, 'q3': -82.55411}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: -98.37896445141065\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.3719894513293207\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='USA', estimate=322, lower_bound=322, upper_bound=322)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 321, 'percentage': 99.68944099378882}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['USA']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 322.0\n", - "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", - "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", - "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", - "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", - "│ 26DE61A551F8BF29F132FF03B62B3E67 │ FEATURE_GROUP │ Succeeded │ None │\n", - "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" - ] - } - ], + "outputs": [], "source": [ "from ads.feature_store.feature_group_job import IngestionMode\n", "feature_group_airports.materialise(airports_df, ingestion_mode=IngestionMode.APPEND)" @@ -2549,7 +630,7 @@ }, { "cell_type": "markdown", - "id": "443bb29e", + "id": "363557f5", "metadata": {}, "source": [ "\n", @@ -2559,322 +640,10 @@ }, { "cell_type": "code", - "execution_count": 24, - "id": "0946e237", + "execution_count": null, + "id": "a869935e", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:great_expectations.validator.validator:\t3 expectation(s) included in expectation_suite.\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "56d101b3aaf54a3c9b4d22624375673b", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Calculating Metrics: 0%| | 0/16 [00:00), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44159a30>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d441301b0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d441305f0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d44130830>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44130270>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d442177f0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44217cf0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d44217770>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=38.9812439184953, minimum=13.48345, maximum=71.28545, central_moments=[1.0, 8.909626780690911e-17, 74.01537930806269, 262.87069420949706, 26574.825385423774]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44217630>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d442176f0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d44217ab0>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=-98.37896445141065, minimum=-176.64603, maximum=-64.79856, central_moments=[1.0, 0.0, 461.80848194502215, -11904.62460720004, 932401.3978279813]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44217db0>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d44217030>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d44141bb0>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44141eb0>)} sfc map\n", - "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", - "INFO:mlm_insights.builder:Profile Generated Successfully\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 14, 'percentage': 4.3478260869565215}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 308.000234832572\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='TX', estimate=24, lower_bound=24, upper_bound=24), FrequentItemEstimate(value='CA', estimate=22, lower_bound=22, upper_bound=22), FrequentItemEstimate(value='AK', estimate=19, lower_bound=19, upper_bound=19), FrequentItemEstimate(value='FL', estimate=17, lower_bound=17, upper_bound=17), FrequentItemEstimate(value='MI', estimate=15, lower_bound=15, upper_bound=15), FrequentItemEstimate(value='NY', estimate=14, lower_bound=14, upper_bound=14), FrequentItemEstimate(value='CO', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='NC', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='MN', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='WI', estimate=8, lower_bound=8, upper_bound=8)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 268, 'percentage': 83.22981366459628}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['TX']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 54.00000710785499\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: 0.41281856359758584\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 8.603219124726667\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 13.48345\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 9.529050000000005\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 57.802\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'density': [0.003134796238244514, 0.0, 0.015673981191222573, 0.01567398119122257, 0.0031347962382445166, 0.0, 0.025078369905956105, 0.021943573667711602, 0.07210031347962384, 0.07836990595611285, 0.10658307210031348, 0.0658307210031348, 0.09404388714733536, 0.11598746081504707, 0.13479623824451414, 0.07836990595611282, 0.06896551724137934, 0.037617554858934144, 0.0, 0.006269592476489061, 0.0, 0.01253918495297801, 0.01567398119122254, 0.012539184952978122, 0.0, 0.0031347962382445305, 0.0031347962382444194, 0.0, 0.0031347962382445305, 0.006269592476489061]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 74.01537930806269\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'frequency': [1, 0, 5, 5, 1, 0, 8, 7, 23, 25, 34, 21, 30, 37, 43, 25, 22, 12, 0, 2, 0, 4, 5, 4, 0, 1, 1, 0, 1, 2]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 71.28545\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 12435.01681\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 33.64044, 'q2': 39.29761, 'q3': 43.16949}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 38.9812439184953\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.850946460274213\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: -1.199562407919743\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 21.489729685247838\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: -176.64603\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 28.225759999999994\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 111.84747\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'density': [0.006269592476489028, 0.003134796238244515, 0.003134796238244513, 0.003134796238244513, 0.009404388714733543, 0.01567398119122257, 0.006269592476489033, 0.009404388714733543, 0.006269592476489019, 0.0, 0.00940438871473355, 0.012539184952978052, 0.0, 0.012539184952978052, 0.05642633228840126, 0.040752351097178674, 0.05642633228840124, 0.028213166144200663, 0.05015673981191221, 0.03134796238244514, 0.09090909090909094, 0.09090909090909094, 0.08150470219435735, 0.10031347962382442, 0.09404388714733547, 0.08150470219435735, 0.056426332288401215, 0.028213166144200663, 0.01567398119122254, 0.0]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 461.80848194502215\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'frequency': [2, 1, 1, 1, 3, 5, 2, 3, 2, 0, 3, 4, 0, 4, 18, 13, 18, 9, 16, 10, 29, 29, 26, 32, 30, 26, 18, 9, 5, 0]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: -64.79856\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: -31382.88966\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': -111.11764, 'q2': -93.66068, 'q3': -82.89188}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: -98.37896445141065\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.3719894513293207\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='USA', estimate=322, lower_bound=322, upper_bound=322)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 321, 'percentage': 99.68944099378882}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['USA']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 322.0\n", - "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", - "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", - "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", - "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", - "│ 26DE61A551F8BF29F132FF03B62B3E67 │ FEATURE_GROUP │ Succeeded │ None │\n", - "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" - ] - } - ], + "outputs": [], "source": [ "from ads.feature_store.feature_group_job import IngestionMode\n", "feature_group_airports.materialise(airports_df, ingestion_mode=IngestionMode.OVERWRITE)" @@ -2882,7 +651,7 @@ }, { "cell_type": "markdown", - "id": "818940f3", + "id": "320681ba", "metadata": {}, "source": [ "\n", @@ -2892,325 +661,10 @@ }, { "cell_type": "code", - "execution_count": 25, - "id": "f6cd567a", + "execution_count": null, + "id": "39016aea", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "INFO:great_expectations.validator.validator:\t3 expectation(s) included in expectation_suite.\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "0845dd8e0c53455abd4f484ad7661c90", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Calculating Metrics: 0%| | 0/16 [00:00), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d442171f0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d4404bd30>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44111cf0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d440f8670>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44111fb0>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d44007230>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44038f70>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d440d9530>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=38.9812439184953, minimum=13.48345, maximum=71.28545, central_moments=[1.0, 8.909626780690911e-17, 74.01537930806269, 262.87069420949706, 26574.825385423774]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44159a70>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d4404e170>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d44141230>), 'c19e3960aa08a392d20aaa5da607d9ea': DescriptiveStatisticsSFC(total_count=319.0, mean=-98.37896445141065, minimum=-176.64603, maximum=-64.79856, central_moments=[1.0, 0.0, 461.80848194502215, -11904.62460720004, 932401.3978279813]), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d44141a70>), '6e3ac490990d92bca69c828fe3aff8ad': QuantilesSFC(kll_sketch=<_datasketches.kll_doubles_sketch object at 0x7f9d44141630>)} sfc map\n", - "INFO:mlm_insights.core.sfcs:creating sfc from {'c5144335a509689fc50d13d03eebc9b1': FrequentItemsSFC(sketch=<_datasketches.frequent_strings_sketch object at 0x7f9d45cafc70>), '4cd1d3704778a196571a6c83581854cc': DistinctCountSFC(sketch=<_datasketches.hll_sketch object at 0x7f9d442037b0>)} sfc map\n", - "INFO:mlm_insights.core.sdcs:creating sdc from {} sdc map\n", - "INFO:mlm_insights.builder:Profile Generated Successfully\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 0, 'percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 322.00025670253893 in Distinct count SFC, upper bound = 322.0163339340549, lower bound = 322.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 322\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 14, 'percentage': 4.3478260869565215}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: []\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 308.000234832572 in Distinct count SFC, upper bound = 308.01561305348736, lower bound = 308.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 308.000234832572\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='TX', estimate=24, lower_bound=24, upper_bound=24), FrequentItemEstimate(value='CA', estimate=22, lower_bound=22, upper_bound=22), FrequentItemEstimate(value='AK', estimate=19, lower_bound=19, upper_bound=19), FrequentItemEstimate(value='FL', estimate=17, lower_bound=17, upper_bound=17), FrequentItemEstimate(value='MI', estimate=15, lower_bound=15, upper_bound=15), FrequentItemEstimate(value='NY', estimate=14, lower_bound=14, upper_bound=14), FrequentItemEstimate(value='CO', estimate=10, lower_bound=10, upper_bound=10), FrequentItemEstimate(value='NC', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='MN', estimate=8, lower_bound=8, upper_bound=8), FrequentItemEstimate(value='WI', estimate=8, lower_bound=8, upper_bound=8)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 268, 'percentage': 83.22981366459628}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['TX']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 54.00000710785499 in Distinct count SFC, upper bound = 54.00270328774326, lower bound = 54.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 54.00000710785499\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: 0.41281856359758584\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 8.603219124726667\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: 13.48345\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 9.529050000000005\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 57.802\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'density': [0.003134796238244514, 0.0, 0.015673981191222573, 0.01567398119122257, 0.0031347962382445166, 0.0, 0.025078369905956105, 0.021943573667711602, 0.07210031347962384, 0.07836990595611285, 0.10658307210031348, 0.0658307210031348, 0.09404388714733536, 0.11598746081504707, 0.13479623824451414, 0.07836990595611282, 0.06896551724137934, 0.037617554858934144, 0.0, 0.006269592476489061, 0.0, 0.01253918495297801, 0.01567398119122254, 0.012539184952978122, 0.0, 0.0031347962382445305, 0.0031347962382444194, 0.0, 0.0031347962382445305, 0.006269592476489061]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 74.01537930806269\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [13.48345, 15.476622413793104, 17.469794827586206, 19.46296724137931, 21.456139655172414, 23.44931206896552, 25.442484482758623, 27.435656896551723, 29.428829310344828, 31.422001724137928, 33.41517413793103, 35.40834655172414, 37.40151896551724, 39.394691379310345, 41.38786379310345, 43.38103620689655, 45.37420862068966, 47.367381034482754, 49.36055344827586, 51.35372586206896, 53.34689827586207, 55.34007068965517, 57.333243103448275, 59.32641551724138, 61.319587931034484, 63.31276034482759, 65.3059327586207, 67.29910517241379, 69.2922775862069, 71.28545], 'frequency': [1, 0, 5, 5, 1, 0, 8, 7, 23, 25, 34, 21, 30, 37, 43, 25, 22, 12, 0, 2, 0, 4, 5, 4, 0, 1, 1, 0, 1, 2]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: 71.28545\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: 12435.01681\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': 33.64044, 'q2': 39.29761, 'q3': 43.16949}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: 38.9812439184953\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.850946460274213\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Skewness metric, value: -1.199562407919743\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Creating StandardDeviation metric, value: 21.489729685247838\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Min metric, value: -176.64603\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsConstantFeature metric, value: False\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.metrics:Calculated IQR metric, value: 28.225759999999994\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Range metric, value: 111.84747\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated ProbabilityDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'density': [0.006269592476489028, 0.003134796238244515, 0.003134796238244513, 0.003134796238244513, 0.009404388714733543, 0.01567398119122257, 0.006269592476489033, 0.009404388714733543, 0.006269592476489019, 0.0, 0.00940438871473355, 0.012539184952978052, 0.0, 0.012539184952978052, 0.05642633228840126, 0.040752351097178674, 0.05642633228840124, 0.028213166144200663, 0.05015673981191221, 0.03134796238244514, 0.09090909090909094, 0.09090909090909094, 0.08150470219435735, 0.10031347962382442, 0.09404388714733547, 0.08150470219435735, 0.056426332288401215, 0.028213166144200663, 0.01567398119122254, 0.0]}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Variance metric, value: 461.80848194502215\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 0, 'integral_type_count': 0, 'fractional_type_count': 319, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated FrequencyDistribution metric, value: {'bins': [-176.64603, -172.78922068965517, -168.93241137931034, -165.0756020689655, -161.21879275862068, -157.36198344827585, -153.50517413793102, -149.6483648275862, -145.79155551724136, -141.93474620689653, -138.0779368965517, -134.22112758620688, -130.36431827586205, -126.50750896551723, -122.65069965517242, -118.79389034482759, -114.93708103448276, -111.08027172413793, -107.2234624137931, -103.36665310344827, -99.50984379310344, -95.65303448275861, -91.79622517241378, -87.93941586206896, -84.08260655172413, -80.2257972413793, -76.36898793103447, -72.51217862068965, -68.65536931034482, -64.79856], 'frequency': [2, 1, 1, 1, 3, 5, 2, 3, 2, 0, 3, 4, 0, 4, 18, 13, 18, 9, 16, 10, 29, 29, 26, 32, 30, 26, 18, 9, 5, 0]}\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 3.0, 'missing_count_percentage': 0.9316770186335404}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Max metric, value: -64.79856\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 319.0002519341608 in Distinct count SFC, upper bound = 319.01617937768685, lower bound = 319.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 319\n", - "INFO:mlm_insights.core.metrics:Calculated Sum metric, value: -31382.88966\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated IsQuasiConstantFeature metric, value: True\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.25\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.5\n", - "INFO:mlm_insights.core.sfcs:getting quantiles from sketch for rank 0.75\n", - "INFO:mlm_insights.core.metrics:Calculated Quartiles metric, value: {'q1': -111.11764, 'q2': -93.66068, 'q3': -82.89188}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Mean metric, value: -98.37896445141065\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Kurtosis metric, value: 1.3719894513293207\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.metrics:Calculated Count metric, value: {'total_count': 322.0, 'missing_count': 0.0, 'missing_count_percentage': 0.0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TopKFrequentElements metric, value: [FrequentItemEstimate(value='USA', estimate=322, lower_bound=322, upper_bound=322)]\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 10 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated TypeMetric metric, value: {'string_type_count': 322, 'integral_type_count': 0, 'fractional_type_count': 0, 'boolean_type_count': 0}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.metrics:Calculated DuplicateCount metric, value: {'count': 321, 'percentage': 99.68944099378882}\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting top 2 frequent items\n", - "INFO:mlm_insights.core.sfcs:Getting list of all frequent items\n", - "INFO:mlm_insights.core.metrics:Calculated Mode metric, value: ['USA']\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:getting SFCMetaData(klass=, config={}) sfc from sfc meta data\n", - "INFO:mlm_insights.core.sfcs:Getting total count of input data\n", - "INFO:mlm_insights.core.sfcs:Calculated cardinality = 1.0 in Distinct count SFC, upper bound = 1.000049929250618, lower bound = 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated DistinctCount metric, value: 1.0\n", - "INFO:mlm_insights.core.metrics:Calculated RowCount metric, value: 322.0\n", - "INFO:ads.feature_store.common.utils.utility:Ingestion Summary \n", - "╒══════════════════════════════════╤═══════════════╤════════════════════╤═════════════════╕\n", - "│ entity_id │ entity_type │ ingestion_status │ error_details │\n", - "╞══════════════════════════════════╪═══════════════╪════════════════════╪═════════════════╡\n", - "│ 26DE61A551F8BF29F132FF03B62B3E67 │ FEATURE_GROUP │ Succeeded │ None │\n", - "╘══════════════════════════════════╧═══════════════╧════════════════════╧═════════════════╛\n" - ] - } - ], + "outputs": [], "source": [ "from ads.feature_store.feature_group_job import IngestionMode\n", "feature_group_airports.materialise(airports_df, ingestion_mode=IngestionMode.UPSERT)" @@ -3218,7 +672,7 @@ }, { "cell_type": "markdown", - "id": "edad9b57", + "id": "61d6d851", "metadata": {}, "source": [ "\n", @@ -3228,285 +682,57 @@ }, { "cell_type": "code", - "execution_count": 26, - "id": "3e909d02", + "execution_count": null, + "id": "ecfb6075", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:57: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/types.py:63: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pa.__version__) < LooseVersion(\"2.0.0\"):\n", - "\n", - "WARNING:py.warnings:/home/datascience/conda/fspyspark32_p38_cpu_v1/lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.\n", - " if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):\n", - "\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
versiontimestampuserIduserNameoperationoperationParametersjobnotebookclusterIdreadVersionisolationLevelisBlindAppendoperationMetricsuserMetadataengineInfo
042023-07-25 10:11:15NoneNoneMERGE{'predicate': '(target_delta_table.IATA_CODE = source_delta_table.IATA_CODE)', 'matchedPredicates': '[{\"actionType\":\"update\"}]', 'notMatchedPredicates': '[{\"actionType\":\"insert\"}]'}NoneNoneNone3.0SerializableFalse{'numTargetRowsCopied': '0', 'numTargetRowsDeleted': '0', 'numTargetFilesAdded': '1', 'executionTimeMs': '5340', 'numTargetRowsInserted': '0', 'scanTimeMs': '2694', 'numTargetRowsUpdated': '322', 'numOutputRows': '322', 'numTargetChangeFilesAdded': '0', 'numSourceRows': '322', 'numTargetFilesRemoved': '2', 'rewriteTimeMs': '2443'}NoneApache-Spark/3.2.1 Delta-Lake/2.0.1
132023-07-25 10:10:51NoneNoneCREATE OR REPLACE TABLE AS SELECT{'isManaged': 'true', 'description': None, 'partitionBy': '[]', 'properties': '{}'}NoneNoneNone2.0SerializableFalse{'numFiles': '2', 'numOutputRows': '322', 'numOutputBytes': '20732'}NoneApache-Spark/3.2.1 Delta-Lake/2.0.1
222023-07-25 10:10:28NoneNoneWRITE{'mode': 'Append', 'partitionBy': '[]'}NoneNoneNone1.0SerializableTrue{'numFiles': '2', 'numOutputRows': '322', 'numOutputBytes': '20732'}NoneApache-Spark/3.2.1 Delta-Lake/2.0.1
312023-07-25 10:10:13NoneNoneCREATE OR REPLACE TABLE AS SELECT{'isManaged': 'true', 'description': None, 'partitionBy': '[]', 'properties': '{}'}NoneNoneNone0.0SerializableFalse{'numFiles': '2', 'numOutputRows': '322', 'numOutputBytes': '20732'}NoneApache-Spark/3.2.1 Delta-Lake/2.0.1
402023-07-25 10:09:13NoneNoneCREATE OR REPLACE TABLE AS SELECT{'isManaged': 'true', 'description': None, 'partitionBy': '[]', 'properties': '{}'}NoneNoneNoneNaNSerializableFalse{'numFiles': '2', 'numOutputRows': '322', 'numOutputBytes': '20174'}NoneApache-Spark/3.2.1 Delta-Lake/2.0.1
\n", - "
" - ], - "text/plain": [ - " version timestamp userId userName \\\n", - "0 4 2023-07-25 10:11:15 None None \n", - "1 3 2023-07-25 10:10:51 None None \n", - "2 2 2023-07-25 10:10:28 None None \n", - "3 1 2023-07-25 10:10:13 None None \n", - "4 0 2023-07-25 10:09:13 None None \n", - "\n", - " operation \\\n", - "0 MERGE \n", - "1 CREATE OR REPLACE TABLE AS SELECT \n", - "2 WRITE \n", - "3 CREATE OR REPLACE TABLE AS SELECT \n", - "4 CREATE OR REPLACE TABLE AS SELECT \n", - "\n", - " operationParameters \\\n", - "0 {'predicate': '(target_delta_table.IATA_CODE = source_delta_table.IATA_CODE)', 'matchedPredicates': '[{\"actionType\":\"update\"}]', 'notMatchedPredicates': '[{\"actionType\":\"insert\"}]'} \n", - "1 {'isManaged': 'true', 'description': None, 'partitionBy': '[]', 'properties': '{}'} \n", - "2 {'mode': 'Append', 'partitionBy': '[]'} \n", - "3 {'isManaged': 'true', 'description': None, 'partitionBy': '[]', 'properties': '{}'} \n", - "4 {'isManaged': 'true', 'description': None, 'partitionBy': '[]', 'properties': '{}'} \n", - "\n", - " job notebook clusterId readVersion isolationLevel isBlindAppend \\\n", - "0 None None None 3.0 Serializable False \n", - "1 None None None 2.0 Serializable False \n", - "2 None None None 1.0 Serializable True \n", - "3 None None None 0.0 Serializable False \n", - "4 None None None NaN Serializable False \n", - "\n", - " operationMetrics \\\n", - "0 {'numTargetRowsCopied': '0', 'numTargetRowsDeleted': '0', 'numTargetFilesAdded': '1', 'executionTimeMs': '5340', 'numTargetRowsInserted': '0', 'scanTimeMs': '2694', 'numTargetRowsUpdated': '322', 'numOutputRows': '322', 'numTargetChangeFilesAdded': '0', 'numSourceRows': '322', 'numTargetFilesRemoved': '2', 'rewriteTimeMs': '2443'} \n", - "1 {'numFiles': '2', 'numOutputRows': '322', 'numOutputBytes': '20732'} \n", - "2 {'numFiles': '2', 'numOutputRows': '322', 'numOutputBytes': '20732'} \n", - "3 {'numFiles': '2', 'numOutputRows': '322', 'numOutputBytes': '20732'} \n", - "4 {'numFiles': '2', 'numOutputRows': '322', 'numOutputBytes': '20174'} \n", - "\n", - " userMetadata engineInfo \n", - "0 None Apache-Spark/3.2.1 Delta-Lake/2.0.1 \n", - "1 None Apache-Spark/3.2.1 Delta-Lake/2.0.1 \n", - "2 None Apache-Spark/3.2.1 Delta-Lake/2.0.1 \n", - "3 None Apache-Spark/3.2.1 Delta-Lake/2.0.1 \n", - "4 None Apache-Spark/3.2.1 Delta-Lake/2.0.1 " - ] - }, - "execution_count": 26, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "feature_group_airports.history().toPandas()" ] }, { "cell_type": "markdown", - "id": "dd5de81e", + "id": "eb8e49ff", "metadata": {}, "source": [ "\n", - "### 3.6. Preview\n", - "\n", - "You can call the ``preview()`` method of the FeatureGroup instance to preview the feature group.\n", + "### 3.7. as_of\n", "\n", - "The ``.preview()`` method takes the following optional parameter:\n", + "You can call the ``as_of()`` method of the FeatureGroup instance to to get specified point in time and time traveled data.\n", + "The ``.as_of()`` method takes the following optional parameter:\n", "\n", - "- timestamp: date-time. Commit timestamp for feature group\n", - "- version_number: int. Version number for feature group\n", - "- row_count: int. Defaults to 10. Total number of row to return" + "- commit_timestamp: date-time. Commit timestamp for feature group\n", + "- version_number: int. Version number for feature group" ] }, { "cell_type": "code", - "execution_count": 27, - "id": "d706e9da", + "execution_count": null, + "id": "a559b6ed", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - " \r" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+--------------------+-------------+-----+--------+----------+-------+\n", - "|IATA_CODE| AIRPORT| CITY|STATE|LATITUDE| LONGITUDE|COUNTRY|\n", - "+---------+--------------------+-------------+-----+--------+----------+-------+\n", - "| ABE|Lehigh Valley Int...| Allentown| PA|40.65236| -75.4404| USA|\n", - "| ABI|Abilene Regional ...| Abilene| TX|32.41132| -99.6819| USA|\n", - "| ABQ|Albuquerque Inter...| Albuquerque| NM|35.04022|-106.60919| USA|\n", - "| ABR|Aberdeen Regional...| Aberdeen| SD|45.44906| -98.42183| USA|\n", - "| ABY|Southwest Georgia...| Albany| GA|31.53552| -84.19447| USA|\n", - "| ACK|Nantucket Memoria...| Nantucket| MA|41.25305| -70.06018| USA|\n", - "| ACT|Waco Regional Air...| Waco| TX|31.61129| -97.23052| USA|\n", - "| ACV| Arcata Airport|Arcata/Eureka| CA|40.97812|-124.10862| USA|\n", - "| ACY|Atlantic City Int...|Atlantic City| NJ|39.45758| -74.57717| USA|\n", - "| ADK| Adak Airport| Adak| AK|51.87796|-176.64603| USA|\n", - "+---------+--------------------+-------------+-----+--------+----------+-------+\n", - "\n" - ] - } - ], + "outputs": [], "source": [ - "feature_group_airports.preview().show()" + "feature_group_airports.as_of(version_number = 0).show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8c701548", + "metadata": {}, + "outputs": [], + "source": [ + "feature_group_airports.as_of(version_number = 1).show()" ] }, { "cell_type": "markdown", - "id": "9881408a", + "id": "0abb3f0e", "metadata": {}, "source": [ - "\n", + "\n", "# References\n", - "\n", + "- [Feature Store Documentation](https://feature-store-accelerated-data-science.readthedocs.io/en/latest/overview.html)\n", "- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)\n", "- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)\n", "- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)\n", @@ -3516,7 +742,7 @@ { "cell_type": "code", "execution_count": null, - "id": "972dfb03", + "id": "965b015a", "metadata": {}, "outputs": [], "source": [] diff --git a/notebook_examples/feature_store_spark_magic.ipynb b/notebook_examples/feature_store_spark_magic.ipynb index d4d19b6b..748c657d 100644 --- a/notebook_examples/feature_store_spark_magic.ipynb +++ b/notebook_examples/feature_store_spark_magic.ipynb @@ -1,8 +1,22 @@ { "cells": [ + { + "cell_type": "raw", + "id": "5c01b54f", + "metadata": {}, + "source": [ + "qweews@notebook{feature_store-querying.ipynb,\n", + " title: Data Flow Studio : Big Data Operations in Feature Store,\n", + " summary: Run Feature Store on interactive Spark workloads on a long lasting Data Flow Cluster.,\n", + " developed_on: fspyspark32_p38_cpu_v1,\n", + " keywords: feature store, querying,\n", + " license: Universal Permissive License v 1.0\n", + "}" + ] + }, { "cell_type": "markdown", - "id": "f10693dc", + "id": "8df0ae4c", "metadata": { "pycharm": { "name": "#%% md\n" @@ -21,50 +35,57 @@ "---\n", "# Overview:\n", "\n", - "This notebook demonstrates how to run interactive Spark workloads on a long lasting [Oracle Cloud Infrastructure Data Flow](https://docs.oracle.com/en-us/iaas/data-flow/using/home.htm) cluster through [Apache Livy](https://livy.apache.org/) integration. **Data Flow Spark Magic** is used for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. It includes a set of magic commands for interactively running Spark code.\n", + "This notebook demonstrates how to run Feature Store on interactive Spark workloads on a long lasting [Oracle Cloud Infrastructure Data Flow](https://docs.oracle.com/en-us/iaas/data-flow/using/home.htm) cluster through [Apache Livy](https://livy.apache.org/) integration. **Data Flow Spark Magic** is used for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. It includes a set of magic commands for interactively running Spark code.\n", "\n", "\n", "\n", "## Contents:\n", "\n", - "- 1. Introduction\n", - "- 1. Pre-requisites\n", - " - 2.1 Policies\n", - " - 2.2 Prerequisites Helpers\n", - " - 2.3 Authentication\n", - " - 2.4 Variables\n", + "- 1. Introduction\n", + "- 2. Pre-requisites\n", + " - 2.1 Policies\n", + " - 2.2 Helpers\n", + " - 2.3 Authentication\n", + " - 2.4 Variables\n", "- 3. Dataflow Magic\n", " - 3.1. Load extension\n", - " - 3.2. Load feature groups\n", + " - 3.2. Create DataFlow Session\n", " - 3.3. Data exploration\n", " - 3.4. Creation of logical entities of feature group\n", " - 3.4.1 Creation of feature store\n", " - 3.4.2 Creation of entity\n", " - 3.4.3 Creation of feature group\n", - " - 3.4.4 Materialisation of feature group\n", + " - 3.4.4 Materialisation of feature group\n", " - 3.4.5 Querying of feature group\n", - "- 4. References\n", + "- 4. References\n", "\n", "---\n", "\n", "\n", - "Compatible conda pack: [PySpark 3.2 and Data Flow](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8\n", - "\n", - "\n", - "\n", - "---" + "Compatible conda pack: [PySpark 3.2 and Feature Store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8 (version 1.0)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e8463c70", + "metadata": {}, + "outputs": [], + "source": [ + "# Upgrade Oracle ADS to pick up the latest preview version to maintain compatibility with Oracle Cloud Infrastructure.\n", + "!pip install --pre --no-deps oracle-ads==2.9.0rc0" ] }, { "cell_type": "markdown", - "id": "f616fcc9", + "id": "496a7e98", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ - "\n", + "\n", "# 1. Introduction\n", "\n", "Oracle feature store is a stack based solution that is deployed in the customer enclave using OCI resource manager. Customer can stand up the service with infrastructure in their own tenancy. The service consists of API which are deployed in customer tenancy using resource manager.\n", @@ -89,35 +110,37 @@ }, { "cell_type": "markdown", - "id": "ca672c4d", + "id": "22bcec86", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ - "\n", - "# 2. Pre-requisites \n", + "\n", + "# 2. Pre-requisites to Running this Notebook\n", "\n", "Data Flow Sessions are accessible through the following conda environment: \n", "\n", - "* **PySpark 3.2 and Feature Store (pyspark_3_v1)**" + "* **PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1)**\n", + "\n", + "The [Data Catalog Hive Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and unstructured data assets. The Metastore is the central metadata repository to understand tables backed by files on object storage. You can customize `fs_pyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Data Flow session cluster. The metastore id of hive metastore is tied to feature store construct of feature store service." ] }, { "cell_type": "markdown", - "id": "33daeebe", + "id": "28fd82db", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ - "\n", + "\n", "## 2.1. Policies\n", "This section covers the creation of dynamic groups and policies needed to use the service.\n", "\n", - "* [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm/)\n", + "* [Data Flow Policies](https://docs.oracle.com/iaas/data-flow/using/policies.htm)\n", "* [Getting Started with Data Flow](https://docs.oracle.com/iaas/data-flow/using/dfs_getting_started.htm)\n", "* [About Data Science Policies](https://docs.oracle.com/iaas/data-science/using/policies.htm)\n", "* [Data Catalog Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm)" @@ -125,7 +148,7 @@ }, { "cell_type": "markdown", - "id": "97ed352c", + "id": "d92fd7df", "metadata": { "pycharm": { "name": "#%% md\n" @@ -139,8 +162,8 @@ }, { "cell_type": "code", - "execution_count": 4, - "id": "6a5e9194", + "execution_count": null, + "id": "3c535a71", "metadata": { "pycharm": { "name": "#%%\n" @@ -158,14 +181,14 @@ }, { "cell_type": "markdown", - "id": "9c0484c6", + "id": "4d699131", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ - "\n", + "\n", "## 2.3. Authentication\n", "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the Data Flow Session Spark cluster.
\n", "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```. " @@ -173,8 +196,8 @@ }, { "cell_type": "code", - "execution_count": 5, - "id": "1e6f441d", + "execution_count": null, + "id": "0ed15b93", "metadata": { "pycharm": { "name": "#%%\n" @@ -189,24 +212,24 @@ }, { "cell_type": "markdown", - "id": "86735a35", + "id": "44d0c3f1", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ - "\n", + "\n", "## 2.4. Variables\n", - "To run this notebook, you must provide some information about your tenancy configuration. To connect to the HIVE metastore, replace `` with the OCID for the HIVE metastore. Connecting to the metastore is optional. \n", + "To run this notebook, you must provide some information about your tenancy configuration. To connect to the HIVE metastore, replace `` with the OCID for the HIVE metastore.\n", "\n", "To create and run a Data Flow session, you must specify a ``, ``, bucket `` and `` for storing logs. These resources must be in the same compartment." ] }, { "cell_type": "code", - "execution_count": 6, - "id": "276d1aec", + "execution_count": null, + "id": "c3ab9476", "metadata": { "pycharm": { "name": "#%%\n" @@ -214,16 +237,17 @@ }, "outputs": [], "source": [ - "compartment_id = \"\"\n", + "import os\n", + "compartment_id = os.environ.get(\"NB_SESSION_COMPARTMENT_OCID\")\n", "metastore_id = \"\"\n", "logs_bucket_uri = \"\"\n", "\n", - "custom_conda_environment_uri = \"oci://service-conda-packs-fs@bigdatadatasciencelarge/service_pack/cpu/PySpark_3.2_and_Feature_Store/1.0/fspyspark32_p38_cpu_v1#conda\"" + "custom_conda_environment_uri = \"oci://service-conda-packs@id19sfcrra6z/service_pack/cpu/PySpark_3.2_and_Feature_Store/1.0/fspyspark32_p38_cpu_v1#conda\"" ] }, { "cell_type": "markdown", - "id": "3fbc6c00", + "id": "835fa366", "metadata": { "pycharm": { "name": "#%% md\n" @@ -243,7 +267,7 @@ }, { "cell_type": "markdown", - "id": "591d4492", + "id": "1b977b2b", "metadata": { "pycharm": { "name": "#%% md\n" @@ -258,8 +282,8 @@ }, { "cell_type": "code", - "execution_count": 7, - "id": "a6e0890f", + "execution_count": null, + "id": "da895c49", "metadata": { "pycharm": { "name": "#%%\n" @@ -272,7 +296,7 @@ }, { "cell_type": "markdown", - "id": "b39ac865", + "id": "f48aa78c", "metadata": { "pycharm": { "name": "#%% md\n" @@ -286,79 +310,14 @@ }, { "cell_type": "code", - "execution_count": 8, - "id": "23c6a9e2", + "execution_count": null, + "id": "79775a26", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Setting up the Cluster..\n" - ] - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "56a9bba76fb7424ea6a7bc207a085508", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Cluster is ready..\n", - "Starting Spark application..\n" - ] - }, - { - "data": { - "text/html": [ - "\n", - "
Session IDKindStateCurrent session
ocid1.dataflowapplication.oc1.iad.anuwcljsnif7xwia5uvy54rp5ybm2u2va6sg2azmpmtsw4i7s2wpqy3thj3apysparkIN_PROGRESSDataflow Run
" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "SparkSession available as 'spark'.\n", - "SparkContext available as 'sc'.\n" - ] - } - ], + "outputs": [], "source": [ "command = prepare_command(\n", " {\n", @@ -386,29 +345,14 @@ }, { "cell_type": "code", - "execution_count": 9, - "id": "53a7b300", + "execution_count": null, + "id": "a00cb706", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "%%spark\n", "from great_expectations.core import ExpectationSuite, ExpectationConfiguration\n", @@ -424,7 +368,7 @@ "import os\n", "\n", "# Set the Authentications for the feature store operations\n", - "ads.set_auth(auth=\"resource_principal\", client_kwargs={\"service_endpoint\": \"https://pac7vnpvfa2xkagazweggatqwy.apigateway.us-ashburn-1.oci.customer-oci.com/20230101\"})\n", + "ads.set_auth(auth=\"resource_principal\", client_kwargs={\"fs_service_endpoint\": \"https://{api_gateway}/20230101\"})\n", "\n", "# Variables\n", "compartment_id = \"\"\n", @@ -433,7 +377,7 @@ }, { "cell_type": "markdown", - "id": "6824f08f", + "id": "f43808d1", "metadata": { "pycharm": { "name": "#%% md\n" @@ -446,60 +390,14 @@ }, { "cell_type": "code", - "execution_count": 10, - "id": "eabdb503", + "execution_count": null, + "id": "90465a30", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+-------------------+-------------------+\n", - "|vendor_id| pickup_at| dropoff_at|\n", - "+---------+-------------------+-------------------+\n", - "| CMT|2011-01-29 02:38:35|2011-01-29 02:47:07|\n", - "| CMT|2011-01-28 10:38:19|2011-01-28 10:42:18|\n", - "| CMT|2011-01-28 23:49:58|2011-01-28 23:57:44|\n", - "| CMT|2011-01-28 23:52:09|2011-01-28 23:59:21|\n", - "| CMT|2011-01-28 10:34:39|2011-01-28 11:25:50|\n", - "| CMT|2011-01-28 23:50:00|2011-01-28 23:58:11|\n", - "| CMT|2011-01-29 02:38:48|2011-01-29 02:50:37|\n", - "| CMT|2011-01-29 02:41:16|2011-01-29 02:45:45|\n", - "| CMT|2011-01-28 23:50:51|2011-01-29 00:07:55|\n", - "| CMT|2011-01-29 02:41:34|2011-01-29 03:08:14|\n", - "| CMT|2011-01-28 23:50:22|2011-01-29 00:03:23|\n", - "| CMT|2011-01-29 02:40:30|2011-01-29 02:43:08|\n", - "| CMT|2011-01-29 02:42:47|2011-01-29 02:50:31|\n", - "| CMT|2011-01-28 23:51:10|2011-01-29 00:03:19|\n", - "| CMT|2011-01-28 05:07:16|2011-01-28 05:12:25|\n", - "| CMT|2011-01-29 02:42:31|2011-01-29 02:55:56|\n", - "| CMT|2011-01-28 23:51:01|2011-01-28 23:59:06|\n", - "| CMT|2011-01-29 02:39:23|2011-01-29 02:59:31|\n", - "| CMT|2011-01-29 02:41:18|2011-01-29 02:50:43|\n", - "| CMT|2011-01-28 10:30:44|2011-01-28 10:48:05|\n", - "+---------+-------------------+-------------------+\n", - "only showing top 20 rows" - ] - } - ], + "outputs": [], "source": [ "%%spark\n", "df_nyc_tlc = spark.read.parquet(\"oci://hosted-ds-datasets@bigdatadatasciencelarge/nyc_tlc/201[1,2,3,4,5,6,7,8]/**/data.parquet\", header=False, inferSchema=True)\n", @@ -510,7 +408,7 @@ }, { "cell_type": "markdown", - "id": "d5e06db4", + "id": "47927e0f", "metadata": { "pycharm": { "name": "#%% md\n" @@ -523,7 +421,7 @@ }, { "cell_type": "markdown", - "id": "c8e0ce2e", + "id": "8a5cd5b9", "metadata": { "pycharm": { "name": "#%% md\n" @@ -537,45 +435,14 @@ }, { "cell_type": "code", - "execution_count": 11, - "id": "7228e930", + "execution_count": null, + "id": "26067de7", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "kind: featurestore\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", - " description: Feature Store Description\n", - " displayName: FeatureStore\n", - " id: 8893420628AB925DBEF259F660862F31\n", - " offlineConfig:\n", - " metastoreId: ocid1.datacatalogmetastore.oc1.iad.amaaaaaanif7xwiaavhd2liaebamr3tbjzio3uw2lxuteoa5ejsfvhqufbsa\n", - "type: featureStore" - ] - } - ], + "outputs": [], "source": [ "%%spark\n", "feature_store_resource = FeatureStore(). \\\n", @@ -590,7 +457,7 @@ }, { "cell_type": "markdown", - "id": "a805da11", + "id": "67bafdd2", "metadata": { "pycharm": { "name": "#%% md\n" @@ -604,43 +471,14 @@ }, { "cell_type": "code", - "execution_count": 12, - "id": "84f611d7", + "execution_count": null, + "id": "657d3fe4", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "kind: entity\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", - " featureStoreId: 8893420628AB925DBEF259F660862F31\n", - " id: 5748B756C5CEE21176FCCDFDB64FA08F\n", - " name: entity_resource-sticky-salmon-2023-07-14-05:46.01\n", - "type: entity" - ] - } - ], + "outputs": [], "source": [ "%%spark\n", "entity = feature_store.create_entity()\n", @@ -649,7 +487,7 @@ }, { "cell_type": "markdown", - "id": "4ccacb09", + "id": "47440cde", "metadata": { "pycharm": { "name": "#%% md\n" @@ -663,70 +501,14 @@ }, { "cell_type": "code", - "execution_count": 13, - "id": "d58b0569", + "execution_count": null, + "id": "690aa9f7", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "kind: FeatureGroup\n", - "spec:\n", - " compartmentId: ocid1.tenancy.oc1..aaaaaaaa25c5a2zpfki3wo4ofza5l72aehvwkjbuavpnzqtmr4nigdgzi57a\n", - " entityId: 5748B756C5CEE21176FCCDFDB64FA08F\n", - " expectationDetails:\n", - " createRuleDetails:\n", - " - arguments:\n", - " column: vendor_id\n", - " levelType: ERROR\n", - " name: Rule-0\n", - " ruleType: EXPECT_COLUMN_VALUES_TO_NOT_BE_NULL\n", - " expectationType: LENIENT\n", - " name: feature_definition\n", - " validationEngineType: GREAT_EXPECTATIONS\n", - " featureStoreId: 8893420628AB925DBEF259F660862F31\n", - " id: 6BAC94626CABC8944E7C29F5D9C8FC5E\n", - " inputFeatureDetails:\n", - " - featureType: STRING\n", - " name: vendor_id\n", - " orderNumber: 1\n", - " - featureType: TIMESTAMP\n", - " name: pickup_at\n", - " orderNumber: 2\n", - " - featureType: TIMESTAMP\n", - " name: dropoff_at\n", - " orderNumber: 3\n", - " isInferSchema: false\n", - " name: feature_group_big_data\n", - " primaryKeys:\n", - " items:\n", - " - name: vendor_id\n", - " statisticsConfig:\n", - " isEnabled: false\n", - "type: featureGroup" - ] - } - ], + "outputs": [], "source": [ "%%spark\n", "\n", @@ -755,7 +537,7 @@ }, { "cell_type": "markdown", - "id": "76f62d36", + "id": "55482958", "metadata": { "pycharm": { "name": "#%% md\n" @@ -768,36 +550,14 @@ }, { "cell_type": "code", - "execution_count": 14, - "id": "807e843c", + "execution_count": null, + "id": "2896b65c", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Calculating Metrics: 100%|##########| 8/8 [01:04<00:00, 8.12s/it]" - ] - } - ], + "outputs": [], "source": [ "%%spark\n", "import pandas as pd\n", @@ -809,7 +569,7 @@ }, { "cell_type": "markdown", - "id": "c3fe60de", + "id": "d7dee26d", "metadata": { "pycharm": { "name": "#%% md\n" @@ -822,50 +582,14 @@ }, { "cell_type": "code", - "execution_count": 15, - "id": "8363e1ea", + "execution_count": null, + "id": "c44d5877", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+-------------------+-------------------+\n", - "|vendor_id| pickup_at| dropoff_at|\n", - "+---------+-------------------+-------------------+\n", - "| VTS|2011-02-27 04:00:00|2011-02-27 04:14:00|\n", - "| VTS|2011-02-27 20:38:00|2011-02-27 20:46:00|\n", - "| VTS|2011-02-27 17:47:00|2011-02-27 17:58:00|\n", - "| VTS|2011-02-26 19:56:00|2011-02-26 20:04:00|\n", - "| VTS|2011-02-23 13:05:00|2011-02-23 13:10:00|\n", - "| VTS|2011-02-27 03:48:00|2011-02-27 04:01:00|\n", - "| VTS|2011-02-27 17:52:00|2011-02-27 18:02:00|\n", - "| VTS|2011-02-27 00:44:00|2011-02-27 01:04:00|\n", - "| VTS|2011-02-27 04:08:00|2011-02-27 04:22:00|\n", - "| VTS|2011-02-27 11:53:00|2011-02-27 12:05:00|\n", - "+---------+-------------------+-------------------+\n", - "only showing top 10 rows" - ] - } - ], + "outputs": [], "source": [ "%%spark\n", "feature_group.select().show()" @@ -873,50 +597,14 @@ }, { "cell_type": "code", - "execution_count": 16, - "id": "a992899c", + "execution_count": null, + "id": "feb2593e", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+-------------------+\n", - "|vendor_id| pickup_at|\n", - "+---------+-------------------+\n", - "| VTS|2011-02-27 04:00:00|\n", - "| VTS|2011-02-27 20:38:00|\n", - "| VTS|2011-02-27 17:47:00|\n", - "| VTS|2011-02-26 19:56:00|\n", - "| VTS|2011-02-23 13:05:00|\n", - "| VTS|2011-02-27 03:48:00|\n", - "| VTS|2011-02-27 17:52:00|\n", - "| VTS|2011-02-27 00:44:00|\n", - "| VTS|2011-02-27 04:08:00|\n", - "| VTS|2011-02-27 11:53:00|\n", - "+---------+-------------------+\n", - "only showing top 10 rows" - ] - } - ], + "outputs": [], "source": [ "%%spark\n", "feature_group.select([\"vendor_id\", \"pickup_at\"]).show()" @@ -924,39 +612,14 @@ }, { "cell_type": "code", - "execution_count": 17, - "id": "aaf454e6", + "execution_count": null, + "id": "f7fb9882", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "+---------+---------+----------+\n", - "|vendor_id|pickup_at|dropoff_at|\n", - "+---------+---------+----------+\n", - "+---------+---------+----------+" - ] - } - ], + "outputs": [], "source": [ "%%spark\n", "feature_group.filter(feature_group.vendor_id == \"CMT\").show()" @@ -964,16 +627,16 @@ }, { "cell_type": "markdown", - "id": "5bd5b8e6", + "id": "180bf37b", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ - "\n", + "\n", "# References\n", - "\n", + "- [Feature Store Documentation](https://feature-store-accelerated-data-science.readthedocs.io/en/latest/overview.html)\n", "- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)\n", "- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)\n", "- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)\n", @@ -983,7 +646,7 @@ { "cell_type": "code", "execution_count": null, - "id": "dd4d0acc", + "id": "009f9008", "metadata": { "pycharm": { "name": "#%%\n" @@ -995,9 +658,9 @@ ], "metadata": { "kernelspec": { - "display_name": "Python [conda env:fspyspark32_p38_cpu#conda_v1]", + "display_name": "Python [conda env:fspyspark32_p38_cpu_v1]", "language": "python", - "name": "conda-env-fspyspark32_p38_cpu_conda_v1-py" + "name": "conda-env-fspyspark32_p38_cpu_v1-py" }, "language_info": { "codemirror_mode": { diff --git a/notebook_examples/index.json b/notebook_examples/index.json index 2280fa0b..00678550 100644 --- a/notebook_examples/index.json +++ b/notebook_examples/index.json @@ -581,5 +581,74 @@ "summary": "Compare training time between CPU and GPU trained models using XGBoost", "time_created": "2023-03-30T10:01:38", "title": "XGBoost with RAPIDS" + }, + { + "developed_on": "fspyspark32_p38_cpu_v1", + "filename": "feature_store_quickstart.ipynb", + "keywords": [ + "pyspark", + "featurestore", + "machine learning", + "feature transformation", + "feature storage", + "feature validation", + "feature statistics" + ], + "license": "Universal Permissive License v 1.0", + "size": 21304, + "summary": "Introduction to the Oracle Cloud Infrastructure Feature Store.Use feature store for feature ingestion and feature querying", + "time_created": "2023-03-29T11:04:51", + "title": "Feature Store Quickstart" + }, + { + "developed_on": "fspyspark32_p38_cpu_v1", + "filename": "feature_store_querying.ipynb", + "keywords": [ + "pyspark", + "featurestore", + "feature querying", + "feature transformation", + "feature storage", + "feature validation", + "feature statistics" + ], + "license": "Universal Permissive License v 1.0", + "size": 21304, + "summary": "Explore Feature Store Functionalities.Transform, Store your Data in Feature Store.Query your data using Feature Store using pandas like interface to query and join", + "time_created": "2023-03-29T11:04:51", + "title": "Feature store handling querying operations" + }, + { + "developed_on": "fspyspark32_p38_cpu_v1", + "filename": "feature_store_schema_evolution.ipynb", + "keywords": [ + "pyspark", + "featurestore", + "feature transformation", + "feature storage", + "schema evolution" + ], + "license": "Universal Permissive License v 1.0", + "size": 21304, + "summary": "Perform Schema Enforcement and Schema Evolution in Feature Store when materialising the data", + "time_created": "2023-03-29T11:04:51", + "title": "Schema Enforcement and Schema Evolution in Feature Store" + }, + { + "developed_on": "fspyspark32_p38_cpu_v1", + "filename": "feature_store_spark_magic.ipynb", + "keywords": [ + "pyspark", + "featurestore", + "feature transformation", + "feature storage", + "schema evolution", + "data flow" + ], + "license": "Universal Permissive License v 1.0", + "size": 21304, + "summary": "Run Feature Store on interactive Spark workloads on a long lasting Data Flow Cluster", + "time_created": "2023-03-29T11:04:51", + "title": "Data Flow Studio : Big Data Operations in Feature Store" } ] \ No newline at end of file From 5375e636aeb801df82d1d1ca76a9b7eb34c307b3 Mon Sep 17 00:00:00 2001 From: najiyacl Date: Thu, 12 Oct 2023 19:39:35 +0530 Subject: [PATCH 3/3] Correction,Incorporating the review comments --- .../feature_store_querying.ipynb | 238 +++++++++--------- .../feature_store_quickstart.ipynb | 190 +++++++------- .../feature_store_schema_evolution.ipynb | 197 ++++++++------- .../feature_store_spark_magic.ipynb | 181 ++++++------- 4 files changed, 417 insertions(+), 389 deletions(-) diff --git a/notebook_examples/feature_store_querying.ipynb b/notebook_examples/feature_store_querying.ipynb index 771aa10e..776f9a52 100644 --- a/notebook_examples/feature_store_querying.ipynb +++ b/notebook_examples/feature_store_querying.ipynb @@ -2,16 +2,12 @@ "cells": [ { "cell_type": "raw", - "id": "a5f5a0ea", - "metadata": { - "pycharm": { - "name": "#%% raw\n" - } - }, + "id": "5ff263c5", + "metadata": {}, "source": [ "qweews@notebook{feature_store-querying.ipynb,\n", - " title: Using feature store for feature querying using pandas like interface for query and join,\n", - " summary: Feature store quickstart guide to perform feature querying using pandas like interface for query and join.,\n", + " title: Feature store handling querying operations\n", + " summary: Using feature store to transform, store and query your data using pandas like interface to query and join\n", " developed_on: fspyspark32_p38_cpu_v1,\n", " keywords: feature store, querying,\n", " license: Universal Permissive License v 1.0\n", @@ -21,7 +17,17 @@ { "cell_type": "code", "execution_count": null, - "id": "983875a7", + "id": "03f57fba", + "metadata": {}, + "outputs": [], + "source": [ + "!odsc conda install -s fspyspark32_p38_cpu_v1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bb694927", "metadata": { "ExecuteTime": { "end_time": "2023-05-24T08:26:08.572567Z", @@ -39,7 +45,7 @@ }, { "cell_type": "markdown", - "id": "3beb360a", + "id": "08c20e45", "metadata": { "pycharm": { "name": "#%% md\n" @@ -48,7 +54,7 @@ "source": [ "Oracle Data Science service sample notebook.\n", "\n", - "Copyright (c) 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).\n", + "Copyright (c) 2022, 2023 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).\n", "\n", "***\n", "\n", @@ -58,15 +64,15 @@ "---\n", "# Overview:\n", "---\n", - "Managing many datasets, data-sources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift and training serving skew all leads to increased model development time and worse model performance. Here, feature store is well positioned to solve many of the problems since it provides a centralised way to transform and access data for training and serving time and helps defines a standardised pipeline for ingestion of data and querying of data.\n", + "Managing many datasets, data sources, and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift, and training serving skew all lead to increased model development time and poor model performance. Feature store can be used to solve many of the problems becuase it provides a centralised way to transform and access data for training and serving time. Feature store helps define a standardised pipeline for ingestion of data and querying of data.This notebook demonstrates how to use feature store using a notebook spark session.\n", "\n", "Compatible conda pack: [PySpark 3.2 and Feature Store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8 (version 1.0)\n", "\n", "## Contents:\n", "\n", "- 1. Introduction\n", - "- 1. Pre-requisites to Running this Notebook\n", - " - 2.1 Setup\n", + "- 2. Pre-requisites to Running this Notebook\n", + " - 2.1. Setup\n", " - 2.2 Policies\n", " - 2.3 Authentication\n", " - 2.4 Variables\n", @@ -76,10 +82,10 @@ " - 3.3. Explore feature groups\n", " - 3.4. Select subset of features\n", " - 3.5. Filter feature groups\n", - " - 3.6. Apply joins on feature group\n", + " - 3.6. Apply joins on feature groups\n", " - 3.7. Create dataset from multiple or one feature group\n", - " - 3.8 Free form sql query\n", - " - 3.9 Feature store Entities using YAML\n", + " - 3.8. Free form sql query\n", + " - 3.9. Feature store Entities using YAML\n", "- 4. References\n", "\n", "---\n", @@ -93,7 +99,7 @@ }, { "cell_type": "markdown", - "id": "56dc5982", + "id": "9c794835", "metadata": { "pycharm": { "name": "#%% md\n" @@ -103,29 +109,29 @@ "\n", "# 1. Introduction\n", "\n", - "Oracle feature store is a stack based solution that is deployed in the customer enclave using OCI resource manager. Customer can stand up the service with infrastructure in their own tenancy. The service consists of API which are deployed in customer tenancy using resource manager.\n", + "OCI Data Science feature store is a stack-based API solution that's deployed using OCI Resource Manager in your tenancy.\n", "\n", - "The following are some key terms that will help you understand OCI Data Science Feature Store:\n", + "Review the following key terms to understand the Data Science feature store:\n", "\n", "\n", - "* **Feature Vector**: Set of feature values for any one primary/identifier key. Eg. All/subset of features of customer id ‘2536’ can be called as one feature vector.\n", + "* **Feature Vector**: Set of feature values for any one primary or identifier key. For example, all or a subset of features of customer id ‘2536’ can be called as one feature vector.\n", "\n", "* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.\n", "\n", - "* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Another way to look at it is that an entity is an object or concept that is described by its features. Examples of entities could be customer, product, transaction, review, image, document, etc.\n", + "* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Or an entity is an object or concept that is described by its features. Examples of entities are customer, product, transaction, review, image, document, and so on.\n", "\n", - "* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in ml models. It serves as an organizational unit within the feature store for users to manage, version and share features across different ml projects. By organizing features into groups, data scientists and ml engineers can efficiently discover, reuse and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.\n", + "* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in machine learning (ML) models. It serves as an organizational unit within the feature store for you to manage, version and share features across different ML projects. By organizing features into groups, data scientists and ML engineers can efficiently discover, reuse and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.\n", "\n", - "* **Feature Group Job**: Feature group job is the execution instance of a feature group. Each feature group job will include validation results and statistics results.\n", + "* **Feature Group Job**: A feature group job is the processing instance of a feature group. Each feature group job includes validation results and statistics results.\n", "\n", - "* **Dataset**: A dataset is a collection of feature that are used together to either train a model or perform model inference.\n", + "* **Dataset**: A dataset is a collection of features that are used together to either train a model or perform model inference.\n", "\n", - "* **Dataset Job**: Dataset job is the execution instance of a dataset. Each dataset job will include validation results and statistics results." + "* **Dataset Job**: dataset job is the processing instance of a dataset. Each dataset job includes validation results and statistics results." ] }, { "cell_type": "markdown", - "id": "6faf8c9a", + "id": "d9cadd7f", "metadata": { "pycharm": { "name": "#%% md\n" @@ -139,12 +145,12 @@ "\n", "* **PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1)**\n", "\n", - "You can customize `fspyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Notebook session cluster. \n" + "You can customize `fspyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Notebook session.\n" ] }, { "cell_type": "markdown", - "id": "5de2b05e", + "id": "45568cac", "metadata": { "pycharm": { "name": "#%% md\n" @@ -181,7 +187,7 @@ }, { "cell_type": "markdown", - "id": "79215ead", + "id": "b83d4381", "metadata": { "pycharm": { "name": "#%% md\n" @@ -200,7 +206,7 @@ }, { "cell_type": "markdown", - "id": "b8ba35e1", + "id": "2c8084e8", "metadata": { "pycharm": { "name": "#%% md\n" @@ -209,14 +215,14 @@ "source": [ "\n", "### 2.3. Authentication\n", - "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook cluster.
\n", + "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook session.
\n", "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```." ] }, { "cell_type": "code", "execution_count": null, - "id": "ec734e55", + "id": "419755bd", "metadata": { "ExecuteTime": { "start_time": "2023-05-24T08:26:08.577504Z" @@ -235,7 +241,7 @@ }, { "cell_type": "markdown", - "id": "68ed4943", + "id": "869ab9aa", "metadata": { "pycharm": { "name": "#%% md\n" @@ -244,13 +250,13 @@ "source": [ "\n", "### 2.4. Variables\n", - "To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `` and bucket `` for offline feature store." + "To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `` and `` which is the OCID of the Data Catalog metastore." ] }, { "cell_type": "code", "execution_count": null, - "id": "e6173268", + "id": "622e7cd2", "metadata": { "pycharm": { "is_executing": true, @@ -267,7 +273,7 @@ }, { "cell_type": "markdown", - "id": "18669545", + "id": "c6b93dd8", "metadata": { "pycharm": { "name": "#%% md\n" @@ -282,7 +288,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7f696caa", + "id": "1c7e9f28", "metadata": { "pycharm": { "is_executing": true, @@ -299,7 +305,7 @@ { "cell_type": "code", "execution_count": null, - "id": "5e31f620", + "id": "a617d39a", "metadata": { "pycharm": { "is_executing": true, @@ -322,7 +328,7 @@ }, { "cell_type": "markdown", - "id": "0a2dd067", + "id": "5473731e", "metadata": { "pycharm": { "name": "#%% md\n" @@ -336,7 +342,7 @@ { "cell_type": "code", "execution_count": null, - "id": "1989eb8d", + "id": "120e216e", "metadata": { "pycharm": { "is_executing": true, @@ -353,7 +359,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d1ddca21", + "id": "6bc0ed52", "metadata": { "pycharm": { "is_executing": true, @@ -369,7 +375,7 @@ { "cell_type": "code", "execution_count": null, - "id": "da859a88", + "id": "0ab07769", "metadata": { "pycharm": { "is_executing": true, @@ -384,7 +390,7 @@ }, { "cell_type": "markdown", - "id": "ac4e1264", + "id": "9e339416", "metadata": { "pycharm": { "name": "#%% md\n" @@ -397,7 +403,7 @@ }, { "cell_type": "markdown", - "id": "b4c78551", + "id": "99671727", "metadata": { "pycharm": { "name": "#%% md\n" @@ -414,7 +420,7 @@ { "cell_type": "code", "execution_count": null, - "id": "6686061a", + "id": "3a210b6a", "metadata": { "pycharm": { "is_executing": true, @@ -435,7 +441,7 @@ { "cell_type": "code", "execution_count": null, - "id": "507427bb", + "id": "1757874f", "metadata": { "pycharm": { "is_executing": true, @@ -450,7 +456,7 @@ }, { "cell_type": "markdown", - "id": "06ff51d1", + "id": "c4206e4f", "metadata": { "pycharm": { "name": "#%% md\n" @@ -464,7 +470,7 @@ { "cell_type": "code", "execution_count": null, - "id": "fb1178da", + "id": "3ab9d52e", "metadata": { "pycharm": { "name": "#%%\n" @@ -481,7 +487,7 @@ }, { "cell_type": "markdown", - "id": "8415e7ba", + "id": "7290a601", "metadata": { "pycharm": { "name": "#%% md\n" @@ -494,7 +500,7 @@ }, { "cell_type": "markdown", - "id": "a1de5443", + "id": "f6660b7e", "metadata": { "pycharm": { "name": "#%% md\n" @@ -514,7 +520,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d1e7b81d", + "id": "cbeed679", "metadata": { "pycharm": { "name": "#%%\n" @@ -536,7 +542,7 @@ { "cell_type": "code", "execution_count": null, - "id": "1e1dd87e", + "id": "fa811b52", "metadata": { "collapsed": false, "jupyter": { @@ -554,7 +560,7 @@ { "cell_type": "code", "execution_count": null, - "id": "f41999bb", + "id": "e0da6f0d", "metadata": { "pycharm": { "name": "#%%\n" @@ -568,7 +574,7 @@ { "cell_type": "code", "execution_count": null, - "id": "6f22a65c", + "id": "8da122b1", "metadata": { "pycharm": { "name": "#%%\n" @@ -581,7 +587,7 @@ }, { "cell_type": "markdown", - "id": "174992cd", + "id": "67016236", "metadata": { "pycharm": { "name": "#%% md\n" @@ -597,7 +603,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d2ff01e9", + "id": "5fabe3f2", "metadata": { "pycharm": { "name": "#%%\n" @@ -632,7 +638,7 @@ { "cell_type": "code", "execution_count": null, - "id": "0a4e00ee", + "id": "99294b1c", "metadata": { "pycharm": { "name": "#%%\n" @@ -658,7 +664,7 @@ { "cell_type": "code", "execution_count": null, - "id": "f16d798b", + "id": "91301811", "metadata": { "collapsed": false, "jupyter": { @@ -676,7 +682,7 @@ { "cell_type": "code", "execution_count": null, - "id": "eab02fe6", + "id": "767ce780", "metadata": { "pycharm": { "name": "#%%\n" @@ -690,7 +696,7 @@ { "cell_type": "code", "execution_count": null, - "id": "c404fd39", + "id": "1fce8a51", "metadata": { "pycharm": { "name": "#%%\n" @@ -703,7 +709,7 @@ }, { "cell_type": "markdown", - "id": "9d44607e", + "id": "15ac3504", "metadata": { "pycharm": { "name": "#%% md\n" @@ -722,7 +728,7 @@ }, { "cell_type": "markdown", - "id": "6800691b", + "id": "192ad5cd", "metadata": { "pycharm": { "name": "#%% md\n" @@ -735,7 +741,7 @@ { "cell_type": "code", "execution_count": null, - "id": "b493fedc", + "id": "a95c8833", "metadata": { "pycharm": { "name": "#%%\n" @@ -757,7 +763,7 @@ { "cell_type": "code", "execution_count": null, - "id": "b065942d", + "id": "0976a401", "metadata": { "pycharm": { "name": "#%%\n" @@ -783,7 +789,7 @@ { "cell_type": "code", "execution_count": null, - "id": "fea7a0fa", + "id": "b9d7ef56", "metadata": { "collapsed": false, "jupyter": { @@ -801,7 +807,7 @@ { "cell_type": "code", "execution_count": null, - "id": "00c8f7bc", + "id": "b926675d", "metadata": { "pycharm": { "name": "#%%\n" @@ -815,7 +821,7 @@ { "cell_type": "code", "execution_count": null, - "id": "45b463d9", + "id": "707c723d", "metadata": { "pycharm": { "name": "#%%\n" @@ -828,7 +834,7 @@ }, { "cell_type": "markdown", - "id": "e33b817c", + "id": "683f2b6f", "metadata": { "pycharm": { "name": "#%% md\n" @@ -842,7 +848,7 @@ { "cell_type": "code", "execution_count": null, - "id": "8228ed24", + "id": "d06d4dbc", "metadata": { "pycharm": { "name": "#%%\n" @@ -856,7 +862,7 @@ { "cell_type": "code", "execution_count": null, - "id": "fcf3b866", + "id": "5fb0ed95", "metadata": { "pycharm": { "name": "#%%\n" @@ -870,7 +876,7 @@ { "cell_type": "code", "execution_count": null, - "id": "a730f3f1", + "id": "edd5c063", "metadata": { "pycharm": { "name": "#%%\n" @@ -883,7 +889,7 @@ }, { "cell_type": "markdown", - "id": "a89ddd8f", + "id": "647a3818", "metadata": {}, "source": [ "You can retrieve feature data in a DataFrame, that can either be used to train models." @@ -892,7 +898,7 @@ { "cell_type": "code", "execution_count": null, - "id": "a5ccdb47", + "id": "3484d5af", "metadata": { "pycharm": { "name": "#%%\n" @@ -906,7 +912,7 @@ { "cell_type": "code", "execution_count": null, - "id": "9e4151b1", + "id": "53111598", "metadata": { "pycharm": { "name": "#%%\n" @@ -920,7 +926,7 @@ { "cell_type": "code", "execution_count": null, - "id": "18dc0c4f", + "id": "1947ac2b", "metadata": { "pycharm": { "name": "#%%\n" @@ -933,7 +939,7 @@ }, { "cell_type": "markdown", - "id": "5dfab426", + "id": "5ff4ffc6", "metadata": {}, "source": [ "You can call the `get_statistics()` method of the feature group to fetch statistics for a specific ingestion job.You can use `to_pandas()` or `to_json()` to view the statistics.\n", @@ -943,7 +949,7 @@ { "cell_type": "code", "execution_count": null, - "id": "cffeb756", + "id": "30d69581", "metadata": { "pycharm": { "name": "#%%\n" @@ -957,7 +963,7 @@ { "cell_type": "code", "execution_count": null, - "id": "b1fdd7f6", + "id": "02bec075", "metadata": { "pycharm": { "name": "#%%\n" @@ -971,7 +977,7 @@ { "cell_type": "code", "execution_count": null, - "id": "64cc1014", + "id": "e31148f6", "metadata": {}, "outputs": [], "source": [ @@ -980,7 +986,7 @@ }, { "cell_type": "markdown", - "id": "6cb585d5", + "id": "11f3a879", "metadata": {}, "source": [ "You can call the `get_validation_output()` method of the FeatureGroup instance to fetch validation results for a specific ingestion job." @@ -989,7 +995,7 @@ { "cell_type": "code", "execution_count": null, - "id": "a13fc434", + "id": "d382ff25", "metadata": { "pycharm": { "name": "#%%\n" @@ -1003,7 +1009,7 @@ { "cell_type": "code", "execution_count": null, - "id": "c219e3f9", + "id": "442d4462", "metadata": {}, "outputs": [], "source": [ @@ -1012,7 +1018,7 @@ }, { "cell_type": "markdown", - "id": "e301ded3", + "id": "e3ded350", "metadata": { "pycharm": { "name": "#%% md\n" @@ -1026,7 +1032,7 @@ { "cell_type": "code", "execution_count": null, - "id": "66194d26", + "id": "e11b7184", "metadata": { "pycharm": { "name": "#%%\n" @@ -1039,7 +1045,7 @@ }, { "cell_type": "markdown", - "id": "dd80ceb0", + "id": "16561536", "metadata": { "pycharm": { "name": "#%% md\n" @@ -1053,7 +1059,7 @@ { "cell_type": "code", "execution_count": null, - "id": "aa4cc044", + "id": "1a251d97", "metadata": { "pycharm": { "name": "#%%\n" @@ -1066,7 +1072,7 @@ }, { "cell_type": "markdown", - "id": "f885a179", + "id": "2944b0e7", "metadata": { "pycharm": { "name": "#%% md\n" @@ -1074,14 +1080,14 @@ }, "source": [ "\n", - "### 3.6. Apply joins on feature group\n", + "### 3.6. Apply joins on feature groups\n", "As in Pandas, if the feature has the same name on both feature groups, then you can use the `on=[]` paramter. If they have different names, then you can use the `left_on=[]` and `right_on=[]` paramters:" ] }, { "cell_type": "code", "execution_count": null, - "id": "526997d3", + "id": "56bedaff", "metadata": { "pycharm": { "name": "#%%\n" @@ -1102,7 +1108,7 @@ { "cell_type": "code", "execution_count": null, - "id": "22dbfa74", + "id": "5d77bcde", "metadata": { "pycharm": { "name": "#%%\n" @@ -1115,7 +1121,7 @@ }, { "cell_type": "markdown", - "id": "9b903c93", + "id": "b018652b", "metadata": { "pycharm": { "name": "#%% md\n" @@ -1123,7 +1129,7 @@ }, "source": [ "\n", - "### 3.7 Dataset\n", + "### 3.7. Create dataset from multiple or one feature group\n", "A dataset is a collection of feature snapshots that are joined together to either train a model or perform model inference.\n", "\n", "
\n", @@ -1134,7 +1140,7 @@ { "cell_type": "code", "execution_count": null, - "id": "5f060a15", + "id": "1df69889", "metadata": { "pycharm": { "name": "#%%\n" @@ -1155,7 +1161,7 @@ }, { "cell_type": "markdown", - "id": "77d3f2a9", + "id": "5bb3d9ff", "metadata": { "pycharm": { "name": "#%% md\n" @@ -1171,7 +1177,7 @@ { "cell_type": "code", "execution_count": null, - "id": "3d95cf4c", + "id": "b5dd4e45", "metadata": { "pycharm": { "name": "#%%\n" @@ -1185,7 +1191,7 @@ { "cell_type": "code", "execution_count": null, - "id": "aaf8c3b4", + "id": "fc64c019", "metadata": { "pycharm": { "name": "#%%\n" @@ -1198,7 +1204,7 @@ }, { "cell_type": "markdown", - "id": "d1ea299d", + "id": "6fabd82e", "metadata": { "pycharm": { "name": "#%% md\n" @@ -1211,7 +1217,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7a3d4e72", + "id": "14b04a7c", "metadata": { "pycharm": { "name": "#%%\n" @@ -1225,7 +1231,7 @@ }, { "cell_type": "markdown", - "id": "40efc4ab", + "id": "2305f112", "metadata": { "pycharm": { "name": "#%% md\n" @@ -1241,7 +1247,7 @@ { "cell_type": "code", "execution_count": null, - "id": "e533a24a", + "id": "bad948c4", "metadata": { "pycharm": { "name": "#%%\n" @@ -1255,7 +1261,7 @@ { "cell_type": "code", "execution_count": null, - "id": "807e340c", + "id": "f3125aa8", "metadata": { "pycharm": { "name": "#%%\n" @@ -1269,7 +1275,7 @@ { "cell_type": "code", "execution_count": null, - "id": "df9155bf", + "id": "840a47fa", "metadata": { "pycharm": { "name": "#%%\n" @@ -1282,7 +1288,7 @@ }, { "cell_type": "markdown", - "id": "db06133b", + "id": "8b3ed236", "metadata": { "pycharm": { "name": "#%% md\n" @@ -1290,14 +1296,14 @@ }, "source": [ "\n", - "### 3.8 Freeform SQL query\n", + "### 3.8. Freeform SQL query\n", "Feature store provides a way to query feature store using free flow query. User need to mention `entity id` as the database name and `feature group name` as the table name to query feature store. This functionality can be useful if you need to express more complex queries for your use case" ] }, { "cell_type": "code", "execution_count": null, - "id": "276e8053", + "id": "5d38518f", "metadata": { "pycharm": { "name": "#%%\n" @@ -1316,7 +1322,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d7987003", + "id": "425b900b", "metadata": { "pycharm": { "name": "#%%\n" @@ -1329,18 +1335,18 @@ }, { "cell_type": "markdown", - "id": "10d6f553", + "id": "1962f56e", "metadata": {}, "source": [ "\n", - "### 3.9 Feature store Entities using YAML\n", + "### 3.9. Feature store Entities using YAML\n", "In an ADS feature store module, you can either use the Python programmatic interface or YAML to define feature store entities. Below section describes how to create feature store entities using YAML as an interface." ] }, { "cell_type": "code", "execution_count": null, - "id": "67f69307", + "id": "2734866a", "metadata": { "pycharm": { "name": "#%%\n" @@ -1446,7 +1452,7 @@ { "cell_type": "code", "execution_count": null, - "id": "db2eb17e", + "id": "b988a15a", "metadata": { "pycharm": { "name": "#%%\n" @@ -1460,7 +1466,7 @@ }, { "cell_type": "markdown", - "id": "93fbdbfe", + "id": "9fee36b0", "metadata": { "pycharm": { "name": "#%% md\n" @@ -1468,7 +1474,7 @@ }, "source": [ "\n", - "# References\n", + "# 4. References\n", "\n", "- [Feature Store Documentation](https://feature-store-accelerated-data-science.readthedocs.io/en/latest/overview.html)\n", "- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)\n", @@ -1478,15 +1484,13 @@ ] }, { - "cell_type": "code", - "execution_count": null, - "id": "4f95ea9b", + "cell_type": "markdown", + "id": "a9c7006c", "metadata": { "pycharm": { "name": "#%%\n" } }, - "outputs": [], "source": [] } ], diff --git a/notebook_examples/feature_store_quickstart.ipynb b/notebook_examples/feature_store_quickstart.ipynb index e795ccb8..26db2c76 100644 --- a/notebook_examples/feature_store_quickstart.ipynb +++ b/notebook_examples/feature_store_quickstart.ipynb @@ -2,12 +2,12 @@ "cells": [ { "cell_type": "raw", - "id": "5563bdd3", + "id": "63f5fcad", "metadata": {}, "source": [ "@notebook{feature_store-quickstart.ipynb,\n", " title: Using feature store for feature ingestion and feature querying,\n", - " summary: Introduction to the Oracle Cloud Infrastructure Feature Store.Use feature store for feature ingestion and feature querying.,\n", + " summary: Introduction to the Oracle Cloud Infrastructure Feature Store.Use feature store for feature ingestion and feature querying,\n", " developed_on: fspyspark32_p38_cpu_v1,\n", " keywords: feature store,\n", " license: Universal Permissive License v 1.0\n", @@ -17,7 +17,17 @@ { "cell_type": "code", "execution_count": null, - "id": "35bdd0d7", + "id": "e4664bc7", + "metadata": {}, + "outputs": [], + "source": [ + "!odsc conda install -s fspyspark32_p38_cpu_v1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "60881fb2", "metadata": { "pycharm": { "is_executing": true @@ -31,12 +41,12 @@ }, { "cell_type": "markdown", - "id": "725a5e59", + "id": "526f6c48", "metadata": {}, "source": [ "Oracle Data Science service sample notebook.\n", "\n", - "Copyright (c) 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).\n", + "Copyright (c) 2022, 2023 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).\n", "\n", "***\n", "\n", @@ -46,14 +56,14 @@ "---\n", "# Overview:\n", "---\n", - "Managing many datasets, data-sources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift and training serving skew all leads to increased model development time and worse model performance. Here, feature store is well positioned to solve many of the problems since it provides a centralised way to transform and access data for training and serving time and helps defines a standardised pipeline for ingestion of data and querying of data.\n", + "Managing many datasets, datasources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift and training serving skew all lead to increased model development time and worse model performance. Feature store can be used to solve many of the problems becuase it provides a centralised way to transform and access data for training and serving time. Feature store helps define a standardised pipeline for ingestion of data and querying of data.This notebook demonstrates how to use feature store using a notebook spark session.\n", "\n", "Compatible conda pack: [PySpark 3.2 and Feature Store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8 (version 1.0)\n", "\n", "## Contents:\n", "\n", "- 1. Introduction\n", - "- 2. Pre-requisites\n", + "- 2. Pre-requisites to Running this Notebook\n", " - 2.1 Setup\n", " - 2.2 Policies\n", " - 2.3 Authentication\n", @@ -61,12 +71,12 @@ "- 3. Feature store quickstart using APIs\n", " - 3.1 Exploration of data\n", " - 3.2 Create feature store logical entities\n", - " - 3.2.1 Create feature store\n", - " - 3.2.2 Create business entity in feature store\n", - " - 3.2.3 Create transformation in feature store\n", - " - 3.2.4 Create feature group and upload data to feature group\n", - " - 3.3 Explore feature group\n", - " - 3.4 Create dataset from multiple or one feature group\n", + " - 3.2.1 Feature store\n", + " - 3.2.2 Entity\n", + " - 3.2.3 Transformation\n", + " - 3.2.4 Feature group \n", + " - 3.3 Explore feature groups\n", + " - 3.4 Create dataset\n", " - 3.3 Explore dataset\n", " - 4. Feature store quickstart using YAML\n", " - 5. References\n", @@ -88,50 +98,48 @@ }, { "cell_type": "markdown", - "id": "90024d60", + "id": "ce2026c1", "metadata": {}, "source": [ "\n", "# 1. Introduction\n", "\n", - "Oracle feature store is a stack based solution that is deployed in the customer enclave using OCI resource manager. Customer can stand up the service with infrastructure in their own tenancy. The service consists of API which are deployed in customer tenancy using resource manager.\n", + "OCI Data Science feature store is a stack-based API solution that's deployed using OCI Resource Manager in your tenancy.\n", "\n", - "The following are some key terms that will help you understand OCI Data Science Feature Store:\n", + "Review the following key terms to understand the Data Science feature store:\n", "\n", "\n", - "* **Feature Vector**: Set of feature values for any one primary/identifier key. Eg. All/subset of features of customer id ‘2536’ can be called as one feature vector.\n", + "* **Feature Vector**: Set of feature values for any one primary or identifier key. For example, all or a subset of features of customer id ‘2536’ can be called as one feature vector.\n", "\n", "* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.\n", "\n", - "* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Another way to look at it is that an entity is an object or concept that is described by its features. Examples of entities could be customer, product, transaction, review, image, document, etc.\n", + "* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Or an entity is an object or concept that is described by its features. Examples of entities are customer, product, transaction, review, image, document, and so on.\n", "\n", - "* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in ml models. It serves as an organizational unit within the feature store for users to manage, version and share features across different ml projects. By organizing features into groups, data scientists and ml engineers can efficiently discover, reuse and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.\n", + "* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in machine learning (ML) models. It serves as an organizational unit within the feature store for you to manage, version and share features across different ML projects. By organizing features into groups, data scientists and ML engineers can efficiently discover, reuse and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.\n", "\n", - "* **Feature Group Job**: Feature group job is the execution instance of a feature group. Each feature group job will include validation results and statistics results.\n", + "* **Feature Group Job**: A feature group job is the processing instance of a feature group. Each feature group job includes validation results and statistics results.\n", "\n", - "* **Dataset**: A dataset is a collection of feature that are used together to either train a model or perform model inference.\n", + "* **Dataset**: A dataset is a collection of features that are used together to either train a model or perform model inference.\n", "\n", - "* **Dataset Job**: Dataset job is the execution instance of a dataset. Each dataset job will include validation results and statistics results." + "* **Dataset Job**: dataset job is the processing instance of a dataset. Each dataset job includes validation results and statistics results." ] }, { "cell_type": "markdown", - "id": "9fb00256", + "id": "b4c99a09", "metadata": {}, "source": [ "\n", - "# 2. Pre-requisites to Running this Notebook \n", - "\n", - "Notebook Sessions are accessible through the following conda environment: \n", + "# 2. Pre-requisites to Running this Notebook\n", "\n", - "* **PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1)**\n", + "Notebook Sessions are accessible using the PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1) conda environment.\n", "\n", "You can customize `fspyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Notebook session cluster. " ] }, { "cell_type": "markdown", - "id": "83904ad6", + "id": "55a6b373", "metadata": {}, "source": [ "\n", @@ -166,7 +174,7 @@ }, { "cell_type": "markdown", - "id": "6bdca361", + "id": "31411ccd", "metadata": {}, "source": [ "\n", @@ -179,19 +187,19 @@ }, { "cell_type": "markdown", - "id": "cf094492", + "id": "a9c2f3f8", "metadata": {}, "source": [ "\n", "### 2.3. Authentication\n", - "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook Spark cluster.
\n", + "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook session.
\n", "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```. " ] }, { "cell_type": "code", "execution_count": null, - "id": "9f35e1a0", + "id": "dae3ada6", "metadata": {}, "outputs": [], "source": [ @@ -201,7 +209,7 @@ }, { "cell_type": "markdown", - "id": "17b184d7", + "id": "e05054be", "metadata": {}, "source": [ "\n", @@ -212,7 +220,7 @@ { "cell_type": "code", "execution_count": null, - "id": "9b7f9ecc", + "id": "42eb13d1", "metadata": {}, "outputs": [], "source": [ @@ -224,7 +232,7 @@ }, { "cell_type": "markdown", - "id": "931d2532", + "id": "a322c822", "metadata": {}, "source": [ "\n", @@ -235,7 +243,7 @@ { "cell_type": "code", "execution_count": null, - "id": "9c8018d4", + "id": "8ff205bc", "metadata": {}, "outputs": [], "source": [ @@ -251,7 +259,7 @@ }, { "cell_type": "markdown", - "id": "4e007b50", + "id": "f30f4edd", "metadata": {}, "source": [ "\n", @@ -261,7 +269,7 @@ { "cell_type": "code", "execution_count": null, - "id": "5882786a", + "id": "43343910", "metadata": {}, "outputs": [], "source": [ @@ -271,7 +279,7 @@ { "cell_type": "code", "execution_count": null, - "id": "a5c9b752", + "id": "71ff424a", "metadata": {}, "outputs": [], "source": [ @@ -281,7 +289,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7e09f121", + "id": "0ade4f83", "metadata": {}, "outputs": [], "source": [ @@ -290,7 +298,7 @@ }, { "cell_type": "markdown", - "id": "58a3b034", + "id": "1af3d7cc", "metadata": {}, "source": [ "\n", @@ -299,11 +307,11 @@ }, { "cell_type": "markdown", - "id": "0faeae33", + "id": "7397f58c", "metadata": {}, "source": [ "\n", - "#### 3.2.1 Feature Store\n", + "#### 3.2.1 Feature store\n", "\n", "Feature store is the top level entity for feature store service.\n", "Call the ```.create()``` method of the Feature store instance to create a feature store." @@ -312,7 +320,7 @@ { "cell_type": "code", "execution_count": null, - "id": "3548e0a6", + "id": "655e04c2", "metadata": {}, "outputs": [], "source": [ @@ -328,7 +336,7 @@ { "cell_type": "code", "execution_count": null, - "id": "e79d9727", + "id": "fbc492ff", "metadata": {}, "outputs": [], "source": [ @@ -337,18 +345,18 @@ }, { "cell_type": "markdown", - "id": "aca2d27c", + "id": "48a349dc", "metadata": {}, "source": [ "\n", "#### 3.2.2 Entity\n", - "An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Another way to look at it is that an entity is an object or concept that is described by its features. Examples of entities could be customer, product, transaction, review, image, document, etc." + "An entity is a group of semantically related features. " ] }, { "cell_type": "code", "execution_count": null, - "id": "28e9762c", + "id": "51ed55b0", "metadata": {}, "outputs": [], "source": [ @@ -360,7 +368,7 @@ }, { "cell_type": "markdown", - "id": "6a1ec785", + "id": "7635cfca", "metadata": {}, "source": [ "\n", @@ -371,7 +379,7 @@ { "cell_type": "code", "execution_count": null, - "id": "dc898997", + "id": "c507286d", "metadata": {}, "outputs": [], "source": [ @@ -383,7 +391,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d624680b", + "id": "c181827c", "metadata": {}, "outputs": [], "source": [ @@ -397,7 +405,7 @@ }, { "cell_type": "markdown", - "id": "550abcbb", + "id": "5c917e7c", "metadata": {}, "source": [ "\n", @@ -407,12 +415,12 @@ }, { "cell_type": "markdown", - "id": "22c00f3a", + "id": "5345aa39", "metadata": {}, "source": [ "\n", "##### 3.2.4.1 Associate Expectation Suite\n", - "Feature validation is the process of checking the quality and accuracy of the features used in a machine learning model.Feature store allows you to define expectation on the data which is being materialized into feature group and dataset.This is achieved using open source library Great Expectations.\n", + "Feature validation is the process of checking the quality and accuracy of the features used in a machine learning model.Feature store allows you to define expectation on the data which is being materialised into feature group and dataset.This is achieved using open source library Great Expectations.\n", "\n", "An Expectation is a verifiable assertion about your data. You can define expectation as below:" ] @@ -420,7 +428,7 @@ { "cell_type": "code", "execution_count": null, - "id": "3d5a352f", + "id": "babc39c3", "metadata": {}, "outputs": [], "source": [ @@ -436,7 +444,7 @@ { "cell_type": "code", "execution_count": null, - "id": "2f9cc4e8", + "id": "bfcf8653", "metadata": {}, "outputs": [], "source": [ @@ -456,7 +464,7 @@ { "cell_type": "code", "execution_count": null, - "id": "67b8b4ef", + "id": "3fe51b5e", "metadata": {}, "outputs": [], "source": [ @@ -465,7 +473,7 @@ }, { "cell_type": "markdown", - "id": "fbe3f5bf", + "id": "28f95654", "metadata": {}, "source": [ "\n", @@ -475,7 +483,7 @@ { "cell_type": "code", "execution_count": null, - "id": "f63e15f1", + "id": "a68d2c02", "metadata": {}, "outputs": [], "source": [ @@ -484,7 +492,7 @@ }, { "cell_type": "markdown", - "id": "d9ac48a1", + "id": "a723bc8f", "metadata": {}, "source": [ "\n", @@ -493,7 +501,7 @@ }, { "cell_type": "markdown", - "id": "0377adfa", + "id": "f012acae", "metadata": {}, "source": [ "You can retrieve feature data in a DataFrame, that can either be used directly to train models or materialized to file(s) for later use to train models" @@ -502,7 +510,7 @@ { "cell_type": "code", "execution_count": null, - "id": "54116cfa", + "id": "e6e9516e", "metadata": {}, "outputs": [], "source": [ @@ -512,7 +520,7 @@ }, { "cell_type": "markdown", - "id": "9f022e11", + "id": "23b9704c", "metadata": {}, "source": [ "You can call the `get_statistics()` method of the feature group to fetch statistics for a specific ingestion job.You can use `to_pandas()` or `to_json()` to view the statistics." @@ -521,7 +529,7 @@ { "cell_type": "code", "execution_count": null, - "id": "00b66cbe", + "id": "0be0b698", "metadata": {}, "outputs": [], "source": [ @@ -530,7 +538,7 @@ }, { "cell_type": "markdown", - "id": "8adf24e2", + "id": "086e9f8a", "metadata": {}, "source": [ "You can visualize feature statistics with `to_viz()`" @@ -539,7 +547,7 @@ { "cell_type": "code", "execution_count": null, - "id": "09afd99d", + "id": "4e3c9a53", "metadata": {}, "outputs": [], "source": [ @@ -549,7 +557,7 @@ { "cell_type": "code", "execution_count": null, - "id": "1a9a05fa", + "id": "63f9d642", "metadata": {}, "outputs": [], "source": [ @@ -558,7 +566,7 @@ }, { "cell_type": "markdown", - "id": "088f602c", + "id": "36ed80f5", "metadata": {}, "source": [ "You can call the `get_validation_output()` method of the FeatureGroup instance to fetch validation results for a specific ingestion job." @@ -567,7 +575,7 @@ { "cell_type": "code", "execution_count": null, - "id": "8dd4687f", + "id": "b6b9b759", "metadata": {}, "outputs": [], "source": [ @@ -577,7 +585,7 @@ { "cell_type": "code", "execution_count": null, - "id": "ce9db608", + "id": "83fd0852", "metadata": {}, "outputs": [], "source": [ @@ -586,7 +594,7 @@ }, { "cell_type": "markdown", - "id": "e468f448", + "id": "1f3eb0dd", "metadata": {}, "source": [ "\n", @@ -598,7 +606,7 @@ { "cell_type": "code", "execution_count": null, - "id": "e147e248", + "id": "ca36cc7b", "metadata": {}, "outputs": [], "source": [ @@ -607,7 +615,7 @@ }, { "cell_type": "markdown", - "id": "e635249e", + "id": "9151b303", "metadata": {}, "source": [ "\n", @@ -618,7 +626,7 @@ { "cell_type": "code", "execution_count": null, - "id": "bc169f01", + "id": "bbf4fd15", "metadata": {}, "outputs": [], "source": [ @@ -628,7 +636,7 @@ { "cell_type": "code", "execution_count": null, - "id": "52f9a271", + "id": "9957c20e", "metadata": {}, "outputs": [], "source": [ @@ -646,7 +654,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d8661c89", + "id": "677a8061", "metadata": {}, "outputs": [], "source": [ @@ -655,7 +663,7 @@ }, { "cell_type": "markdown", - "id": "baaf2112", + "id": "1b9fca33", "metadata": {}, "source": [ "You can call the `materialise()` method of the Dataset instance to load the data to dataset." @@ -664,7 +672,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7228ed61", + "id": "0d7a6a34", "metadata": {}, "outputs": [], "source": [ @@ -673,7 +681,7 @@ }, { "cell_type": "markdown", - "id": "b1b09af2", + "id": "5c77773c", "metadata": {}, "source": [ "\n", @@ -683,7 +691,7 @@ { "cell_type": "code", "execution_count": null, - "id": "028c72dc", + "id": "c1832b28", "metadata": {}, "outputs": [], "source": [ @@ -693,7 +701,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d5e2c54d", + "id": "de6c4045", "metadata": {}, "outputs": [], "source": [ @@ -703,7 +711,7 @@ { "cell_type": "code", "execution_count": null, - "id": "dd6e28d2", + "id": "2b6da96a", "metadata": {}, "outputs": [], "source": [ @@ -713,7 +721,7 @@ { "cell_type": "code", "execution_count": null, - "id": "4fd4ed61", + "id": "b0f4dcc2", "metadata": {}, "outputs": [], "source": [ @@ -722,7 +730,7 @@ }, { "cell_type": "markdown", - "id": "76558b69", + "id": "e6419d55", "metadata": {}, "source": [ "\n", @@ -734,7 +742,7 @@ { "cell_type": "code", "execution_count": null, - "id": "8031042b", + "id": "ee24e1d8", "metadata": {}, "outputs": [], "source": [ @@ -743,18 +751,18 @@ }, { "cell_type": "markdown", - "id": "e9aab9aa", + "id": "9b9b5cce", "metadata": {}, "source": [ "\n", - "# 4. Feature store quick start using YAML\n", + "# 4. Feature store quickstart using YAML\n", "In an ADS feature store module, you can either use the Python programmatic interface or YAML to define feature store entities. Below section describes how to create feature store entities using YAML as an interface." ] }, { "cell_type": "code", "execution_count": null, - "id": "1cf18dd5", + "id": "b7479c28", "metadata": {}, "outputs": [], "source": [ @@ -806,7 +814,7 @@ { "cell_type": "code", "execution_count": null, - "id": "e4c774bf", + "id": "ebdbb40e", "metadata": {}, "outputs": [], "source": [ @@ -816,7 +824,7 @@ }, { "cell_type": "markdown", - "id": "57a43397", + "id": "3bc2818c", "metadata": {}, "source": [ "\n", @@ -832,7 +840,7 @@ { "cell_type": "code", "execution_count": null, - "id": "bb23af05", + "id": "ff4d2ad3", "metadata": {}, "outputs": [], "source": [] diff --git a/notebook_examples/feature_store_schema_evolution.ipynb b/notebook_examples/feature_store_schema_evolution.ipynb index a430b97f..6fd1c115 100644 --- a/notebook_examples/feature_store_schema_evolution.ipynb +++ b/notebook_examples/feature_store_schema_evolution.ipynb @@ -2,14 +2,14 @@ "cells": [ { "cell_type": "raw", - "id": "6e72604a", + "id": "12ce2509", "metadata": {}, "source": [ - "qweews@notebook{feature_store-querying.ipynb,\n", - " title: Schema Enforcement and Schema Evolution in Feature Store ,\n", + "qweews@notebook{feature_store_schema_evolution.ipynb,\n", + " title: Schema Enforcement and Schema Evolution in Feature Store,\n", " summary: Perform Schema Enforcement and Schema Evolution in Feature Store when materialising the data.,\n", " developed_on: fspyspark32_p38_cpu_v1,\n", - " keywords: feature store, querying,\n", + " keywords: feature store, querying ,schema enforcement,schema evolution\n", " license: Universal Permissive License v 1.0\n", "}" ] @@ -17,7 +17,17 @@ { "cell_type": "code", "execution_count": null, - "id": "997bb810", + "id": "59b6b678", + "metadata": {}, + "outputs": [], + "source": [ + "!odsc conda install -s fspyspark32_p38_cpu_v1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "77341f7e", "metadata": { "ExecuteTime": { "end_time": "2023-05-24T08:26:08.572567Z", @@ -32,12 +42,12 @@ }, { "cell_type": "markdown", - "id": "3dd0bbd5", + "id": "eafaf892", "metadata": {}, "source": [ "Oracle Data Science service sample notebook.\n", "\n", - "Copyright (c) 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).\n", + "Copyright (c) 2022, 2023 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).\n", "\n", "***\n", "\n", @@ -47,7 +57,7 @@ "---\n", "# Overview:\n", "---\n", - "Managing many datasets, data-sources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift and training serving skew all leads to increased model development time and worse model performance. Here, feature store is well positioned to solve many of the problems since it provides a centralised way to transform and access data for training and serving time and helps defines a standardised pipeline for ingestion of data and querying of data. This notebook shows how schema enforcement and schema evolution are carried out in Feature Store\n", + "Managing many datasets, data sources and transformations for machine learning is complex and costly. Poorly cleaned data, data issues, bugs in transformations, data drift, and training serving skew all lead to increased model development time and poor model performance. Feature store can be used to solve many of the problems becuase it provides a centralised way to transform and access data for training and serving time. Feature store helps define a standardised pipeline for ingestion of data and querying of data. This notebook shows how schema enforcement and schema evolution are carried out in Feature Store\n", "\n", "Compatible conda pack: [PySpark 3.2 and Feature store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8\n", "\n", @@ -58,21 +68,22 @@ "## Contents:\n", "\n", "- 1. Introduction\n", - "- 2. Pre-requisites\n", - " - 2.1 Setup\n", - " - 2.2 Policies\n", - " - 2.3 Authentication\n", - " - 2.4 Variables\n", + "- 2. Pre-requisites to Running this Notebook\n", + " - 2.1. Setup\n", + " - 2.2. Policies\n", + " - 2.3. Authentication\n", + " - 2.4. Variables\n", "- 3. Schema enforcement and schema evolution\n", " - 3.1. Exploration of data in feature store\n", " - 3.2. Create feature store logical entities\n", " - 3.3. Schema enforcement\n", - " - 3.4. Ingestion Modes\n", - " - 3.4.1 Append\n", - " - 3.4.2 Overwrite\n", - " - 3.4.3 Upsert\n", - " - 3.5. History\n", - " - 3.6. As_of Feature \n", + " - 3.4. Schema evolution\n", + " - 3.5. Ingestion Modes\n", + " - 3.5.1. Append\n", + " - 3.5.2. Overwrite\n", + " - 3.5.3. Upsert\n", + " - 3.6. Viewing Feature Group History\n", + " - 3.7. Time travel Queries on Feature Group \n", "- 4. References\n", "\n", "---\n", @@ -86,49 +97,47 @@ }, { "cell_type": "markdown", - "id": "cc61a6ad", + "id": "2df44476", "metadata": {}, "source": [ "\n", "# 1. Introduction\n", "\n", - "Oracle feature store is a stack based solution that is deployed in the customer enclave using OCI resource manager. Customer can stand up the service with infrastructure in their own tenancy. The service consists of API which are deployed in customer tenancy using resource manager.\n", + "OCI Data Science feature store is a stack-based API solution that's deployed using OCI Resource Manager in your tenancy.\n", "\n", - "The following are some key terms that will help you understand OCI Data Science Feature Store:\n", + "Review the following key terms to understand the Data Science feature store:\n", "\n", "\n", - "* **Feature Vector**: Set of feature values for any one primary/identifier key. Eg. All/subset of features of customer id ‘2536’ can be called as one feature vector.\n", + "* **Feature Vector**: Set of feature values for any one primary or identifier key. For example, all or a subset of features of customer id ‘2536’ can be called as one feature vector.\n", "\n", "* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.\n", "\n", - "* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Another way to look at it is that an entity is an object or concept that is described by its features. Examples of entities could be customer, product, transaction, review, image, document, etc.\n", + "* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Or, an entity is an object or concept that is described by its features. Examples of entities are customer, product, transaction, review, image, document, and so on.\n", "\n", - "* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in ml models. It serves as an organizational unit within the feature store for users to manage, version and share features across different ml projects. By organizing features into groups, data scientists and ml engineers can efficiently discover, reuse and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.\n", + "* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in machine learning (ML) models. It serves as an organizational unit within the feature store for you to manage, version, and share features across different ML projects. By organizing features into groups, data scientists and ML engineers can efficiently discover, reuse, and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.\n", "\n", - "* **Feature Group Job**: Feature group job is the execution instance of a feature group. Each feature group job will include validation results and statistics results.\n", + "* **Feature Group Job**: A feature group job is the processing instance of a feature group. Each feature group job includes validation and statistics results.\n", "\n", - "* **Dataset**: A dataset is a collection of feature that are used together to either train a model or perform model inference.\n", + "* **Dataset**: A dataset is a collection of features that are used together to either train a model or perform model inference.\n", "\n", - "* **Dataset Job**: Dataset job is the execution instance of a dataset. Each dataset job will include validation results and statistics results." + "* **Dataset Job**: A dataset job is the processing instance of a dataset. Each dataset job includes validation and statistics results." ] }, { "cell_type": "markdown", - "id": "10ada53a", + "id": "c76e31af", "metadata": {}, "source": [ "\n", "# 2. Pre-requisites to Running this Notebook\n", - "Notebook Sessions are accessible through the following conda environment: \n", - "\n", - "* **PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1)**\n", + "Notebook Sessions are accessible using the PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1) conda environment.\n", "\n", - "You can customize `fspyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Notebook session cluster. \n" + "You can customize `fspyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Notebook session.\n" ] }, { "cell_type": "markdown", - "id": "e519c49e", + "id": "1233c93e", "metadata": {}, "source": [ "\n", @@ -161,7 +170,7 @@ }, { "cell_type": "markdown", - "id": "e840f262", + "id": "24965d4e", "metadata": {}, "source": [ "\n", @@ -176,19 +185,19 @@ }, { "cell_type": "markdown", - "id": "eeec1d4d", + "id": "c74885f6", "metadata": {}, "source": [ "\n", "### 2.3. Authentication\n", - "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook cluster.
\n", + "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the notebook session.
\n", "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```." ] }, { "cell_type": "code", "execution_count": null, - "id": "233ac5e8", + "id": "24963829", "metadata": { "ExecuteTime": { "start_time": "2023-05-24T08:26:08.577504Z" @@ -206,18 +215,18 @@ }, { "cell_type": "markdown", - "id": "429c36d6", + "id": "b0c5e0fd", "metadata": {}, "source": [ "\n", "### 2.4. Variables\n", - "To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `` and bucket `` for offline feature store." + "To run this notebook, you must provide some information about your tenancy configuration. To create and run a feature store, you must specify a `` and `` for offline feature store." ] }, { "cell_type": "code", "execution_count": null, - "id": "80e80a24", + "id": "edaf733c", "metadata": { "pycharm": { "is_executing": true @@ -233,18 +242,18 @@ }, { "cell_type": "markdown", - "id": "e9f96e28", + "id": "c9c2e7c8", "metadata": {}, "source": [ "\n", "# 3. Schema enforcement and schema evolution\n", - "By default the **PySpark 3.2, Feature store and Data Flow** conda environment includes pre-installed [great-expectations](https://legacy.docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html).Schema enforcement is a Delta Lake feature that prevents you from appending data with a different schema to a table.To change a table's current schema and to accommodate data that is changing over time,Schema evolution feature is used while performing an append or overwrite operation." + "By default the **PySpark 3.2, Feature store and Data Flow** conda environment includes pre-installed [great-expectations](https://legacy.docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html).Schema enforcement is a Delta Lake feature that prevents you from appending data with a different schema to a table.To change a table's current schema and to accommodate data that is changing over time,schema evolution feature is used while performing an append or overwrite operation." ] }, { "cell_type": "code", "execution_count": null, - "id": "b1169e3a", + "id": "75d9beed", "metadata": { "pycharm": { "is_executing": true @@ -266,7 +275,7 @@ }, { "cell_type": "markdown", - "id": "a9d0cad0", + "id": "7ff53923", "metadata": {}, "source": [ "\n", @@ -280,7 +289,7 @@ { "cell_type": "code", "execution_count": null, - "id": "8b59c7e4", + "id": "f43e2ef0", "metadata": {}, "outputs": [], "source": [ @@ -292,7 +301,7 @@ { "cell_type": "code", "execution_count": null, - "id": "6735e954", + "id": "82430a0d", "metadata": { "pycharm": { "is_executing": true @@ -308,7 +317,7 @@ { "cell_type": "code", "execution_count": null, - "id": "363c818b", + "id": "143c3b29", "metadata": { "pycharm": { "is_executing": true @@ -322,7 +331,7 @@ }, { "cell_type": "markdown", - "id": "4c800a75", + "id": "00083134", "metadata": {}, "source": [ "\n", @@ -331,17 +340,17 @@ }, { "cell_type": "markdown", - "id": "ab64f16f", + "id": "4f99ae87", "metadata": {}, "source": [ - "#### 3.2.1 Feature Store\n", + "#### 3.2.1. Feature Store\n", "Feature store is the top level entity for feature store service" ] }, { "cell_type": "code", "execution_count": null, - "id": "01c4dc79", + "id": "ca0b8bfd", "metadata": { "pycharm": { "is_executing": true @@ -360,7 +369,7 @@ }, { "cell_type": "markdown", - "id": "d6c3e1bf", + "id": "9704fa85", "metadata": {}, "source": [ "\n", @@ -372,7 +381,7 @@ { "cell_type": "code", "execution_count": null, - "id": "35d70317", + "id": "de4b205d", "metadata": { "pycharm": { "is_executing": true @@ -386,17 +395,17 @@ }, { "cell_type": "markdown", - "id": "de92fc24", + "id": "473f6677", "metadata": {}, "source": [ - "#### 3.2.2 Entity\n", + "#### 3.2.2. Entity\n", "An entity is a group of semantically related features." ] }, { "cell_type": "code", "execution_count": null, - "id": "39087c3a", + "id": "3dcf22bf", "metadata": {}, "outputs": [], "source": [ @@ -409,11 +418,11 @@ }, { "cell_type": "markdown", - "id": "33485b3e", + "id": "80c9c3be", "metadata": {}, "source": [ "\n", - "#### 3.2.3 Feature Group\n", + "#### 3.2.3. Feature Group\n", "\n", "Create feature group for airport" ] @@ -421,7 +430,7 @@ { "cell_type": "code", "execution_count": null, - "id": "13ff8e8c", + "id": "970161e6", "metadata": {}, "outputs": [], "source": [ @@ -454,7 +463,7 @@ { "cell_type": "code", "execution_count": null, - "id": "66fae082", + "id": "bc323dd5", "metadata": {}, "outputs": [], "source": [ @@ -476,7 +485,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d966fc78", + "id": "2e437522", "metadata": { "collapsed": false, "jupyter": { @@ -491,7 +500,7 @@ { "cell_type": "code", "execution_count": null, - "id": "e4bbefa2", + "id": "3fc98501", "metadata": {}, "outputs": [], "source": [ @@ -501,7 +510,7 @@ { "cell_type": "code", "execution_count": null, - "id": "9f9519e6", + "id": "ec22e95c", "metadata": {}, "outputs": [], "source": [ @@ -510,19 +519,19 @@ }, { "cell_type": "markdown", - "id": "dff776cc", + "id": "ed7c012e", "metadata": {}, "source": [ "\n", "### 3.3. Schema enforcement\n", "\n", - "Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table's schema. Like the front desk manager at a busy restaurant that only accepts reservations, it checks to see whether each column in data inserted into the table is on its list of expected columns (in other words, whether each one has a \"reservation\"), and rejects any writes with columns that aren't on the list." + "Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table's schema. For example, a front desk manager at a busy restaurant that only accepts reservations, the schema enforcement checks to see whether each column in the data inserted into the table is in the list of expected columns. Meaning each one has a \"reservation\", and rejects any writes with columns that aren't on the list." ] }, { "cell_type": "code", "execution_count": null, - "id": "1791d8f0", + "id": "eef566cc", "metadata": {}, "outputs": [], "source": [ @@ -534,7 +543,7 @@ { "cell_type": "code", "execution_count": null, - "id": "c6357225", + "id": "d620cedf", "metadata": {}, "outputs": [], "source": [ @@ -545,7 +554,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7dc15e14", + "id": "c62e82c9", "metadata": {}, "outputs": [], "source": [ @@ -554,19 +563,19 @@ }, { "cell_type": "markdown", - "id": "107c8b58", + "id": "e8cbab63", "metadata": {}, "source": [ "\n", "### 3.4. Schema evolution\n", "\n", - "Schema evolution is a feature that allows users to easily change a table's current schema to accommodate data that is changing over time. Most commonly, it's used when performing an append or overwrite operation, to automatically adapt the schema to include one or more new columns." + "Schema evolution allows you to change a table's current schema to accommodate data that is changing over time. Typically, it's used when performing an append or overwrite operation to automatically adapt the schema to include one or more new columns." ] }, { "cell_type": "code", "execution_count": null, - "id": "aeba1145", + "id": "d69f3378", "metadata": {}, "outputs": [], "source": [ @@ -577,7 +586,7 @@ { "cell_type": "code", "execution_count": null, - "id": "42e74b33", + "id": "794597c9", "metadata": {}, "outputs": [], "source": [ @@ -590,7 +599,7 @@ { "cell_type": "code", "execution_count": null, - "id": "f4e11f65", + "id": "bcaa552c", "metadata": {}, "outputs": [], "source": [ @@ -599,7 +608,7 @@ }, { "cell_type": "markdown", - "id": "a30c68c3", + "id": "b4eca757", "metadata": {}, "source": [ "\n", @@ -608,7 +617,7 @@ }, { "cell_type": "markdown", - "id": "43eb6897", + "id": "7ae4c0e5", "metadata": {}, "source": [ "\n", @@ -620,7 +629,7 @@ { "cell_type": "code", "execution_count": null, - "id": "c6587f2e", + "id": "24acf8e6", "metadata": {}, "outputs": [], "source": [ @@ -630,18 +639,18 @@ }, { "cell_type": "markdown", - "id": "363557f5", + "id": "91ed98b3", "metadata": {}, "source": [ "\n", "#### 3.5.2. Overwrite\n", - "In ``overwrite`` mode, the existing table is replaced entirely with the new data being saved. If the table already exists, it will be dropped and a new table will be created with the new data. This mode is useful when you want to completely refresh the data in the table with the latest data, discarding any previous records." + "In ``overwrite`` mode, the existing table is replaced entirely with the new data being saved. If the table already exists, it is dropped and a new table is created with the new data. This mode is useful when you want to completely refresh the data in the table with the latest data and discard all previous records." ] }, { "cell_type": "code", "execution_count": null, - "id": "a869935e", + "id": "690a6136", "metadata": {}, "outputs": [], "source": [ @@ -651,18 +660,18 @@ }, { "cell_type": "markdown", - "id": "320681ba", + "id": "1866795c", "metadata": {}, "source": [ "\n", "#### 3.5.3. Upsert\n", - "``Upsert`` mode, also known as ``merge`` mode, is used to update existing records in the table based on a primary key or a specified condition. If a record with the same key exists, it will be updated with the new data; otherwise, a new record will be inserted. This mode is useful for maintaining and synchronizing data between the source and destination tables while avoiding duplicates." + "``Upsert`` mode (merge mode) is used to update existing records in the table based on a primary key or a specified condition. If a record with the same key exists, it is updated with the new data. Otherwise, a new record is inserted. This mode is useful for maintaining and synchronizing data between the source and destination tables while avoiding duplicates." ] }, { "cell_type": "code", "execution_count": null, - "id": "39016aea", + "id": "b2ddd858", "metadata": {}, "outputs": [], "source": [ @@ -672,18 +681,18 @@ }, { "cell_type": "markdown", - "id": "61d6d851", + "id": "5495e3cf", "metadata": {}, "source": [ "\n", - "### 3.6. History\n", + "### 3.6. Viewing Feature Group History\n", "You can call the ``history()`` method of the FeatureGroup instance to show history of the feature group." ] }, { "cell_type": "code", "execution_count": null, - "id": "ecfb6075", + "id": "feb1762f", "metadata": {}, "outputs": [], "source": [ @@ -692,13 +701,13 @@ }, { "cell_type": "markdown", - "id": "eb8e49ff", + "id": "1fb57d2c", "metadata": {}, "source": [ "\n", - "### 3.7. as_of\n", + "### 3.7. Time travel Queries on Feature Group\n", "\n", - "You can call the ``as_of()`` method of the FeatureGroup instance to to get specified point in time and time traveled data.\n", + "You can call the ``as_of()`` method of the FeatureGroup instance to get specified point in time and time traveled data.\n", "The ``.as_of()`` method takes the following optional parameter:\n", "\n", "- commit_timestamp: date-time. Commit timestamp for feature group\n", @@ -708,7 +717,7 @@ { "cell_type": "code", "execution_count": null, - "id": "a559b6ed", + "id": "1ec4cc00", "metadata": {}, "outputs": [], "source": [ @@ -718,7 +727,7 @@ { "cell_type": "code", "execution_count": null, - "id": "8c701548", + "id": "4ce013ab", "metadata": {}, "outputs": [], "source": [ @@ -727,11 +736,11 @@ }, { "cell_type": "markdown", - "id": "0abb3f0e", + "id": "1abcc338", "metadata": {}, "source": [ "\n", - "# References\n", + "# 4. References\n", "- [Feature Store Documentation](https://feature-store-accelerated-data-science.readthedocs.io/en/latest/overview.html)\n", "- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)\n", "- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)\n", @@ -742,7 +751,7 @@ { "cell_type": "code", "execution_count": null, - "id": "965b015a", + "id": "1157840c", "metadata": {}, "outputs": [], "source": [] diff --git a/notebook_examples/feature_store_spark_magic.ipynb b/notebook_examples/feature_store_spark_magic.ipynb index 748c657d..98d2cb40 100644 --- a/notebook_examples/feature_store_spark_magic.ipynb +++ b/notebook_examples/feature_store_spark_magic.ipynb @@ -2,21 +2,42 @@ "cells": [ { "cell_type": "raw", - "id": "5c01b54f", + "id": "8ce4f16c", "metadata": {}, "source": [ - "qweews@notebook{feature_store-querying.ipynb,\n", - " title: Data Flow Studio : Big Data Operations in Feature Store,\n", + "qweews@notebook{feature_store_spark_magic.ipynb,\n", + " title: Data Flow Studio : Big Data Operations in Feature Store.,\n", " summary: Run Feature Store on interactive Spark workloads on a long lasting Data Flow Cluster.,\n", " developed_on: fspyspark32_p38_cpu_v1,\n", - " keywords: feature store, querying,\n", + " keywords: feature store, querying,spark magic,data flow\n", " license: Universal Permissive License v 1.0\n", "}" ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "55da3909", + "metadata": {}, + "outputs": [], + "source": [ + "!odsc conda install -s fspyspark32_p38_cpu_v1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5c24e9f2", + "metadata": {}, + "outputs": [], + "source": [ + "# Upgrade Oracle ADS to pick up the latest preview version to maintain compatibility with Oracle Cloud Infrastructure.\n", + "!pip install --pre --no-deps oracle-ads==2.9.0rc0" + ] + }, { "cell_type": "markdown", - "id": "8df0ae4c", + "id": "fe5598bd", "metadata": { "pycharm": { "name": "#%% md\n" @@ -25,8 +46,7 @@ "source": [ "Oracle Data Science service sample notebook.\n", "\n", - "Copyright (c) 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).\n", - "\n", + "Copyright (c) 2022, 2023 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).\n", "***\n", "\n", "# Data Flow Studio: Big Data Operations in Feature Store\n", @@ -35,28 +55,28 @@ "---\n", "# Overview:\n", "\n", - "This notebook demonstrates how to run Feature Store on interactive Spark workloads on a long lasting [Oracle Cloud Infrastructure Data Flow](https://docs.oracle.com/en-us/iaas/data-flow/using/home.htm) cluster through [Apache Livy](https://livy.apache.org/) integration. **Data Flow Spark Magic** is used for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. It includes a set of magic commands for interactively running Spark code.\n", + "This notebook demonstrates how to run Feature Store on interactive Spark workloads on a long lasting [Oracle Cloud Infrastructure Data Flow](https://docs.oracle.com/en-us/iaas/data-flow/using/home.htm) cluster through [Apache Livy](https://livy.apache.org/) integration. **Data Flow Spark Magic** is used for interactively working with remote Spark clusters using Livy (a Spark REST server) in Jupyter notebooks. Data Flow Spark Magic includes a set of magic commands for interactively running Spark code.\n", "\n", "\n", "\n", "## Contents:\n", "\n", "- 1. Introduction\n", - "- 2. Pre-requisites\n", + "- 2. 2. Pre-requisites to Running this Notebook\n", " - 2.1 Policies\n", " - 2.2 Helpers\n", " - 2.3 Authentication\n", " - 2.4 Variables\n", - "- 3. Dataflow Magic\n", - " - 3.1. Load extension\n", + "- 3. Data Flow Spark Magic\n", + " - 3.1. Load Spark Magic Commands and Getting Help\n", " - 3.2. Create DataFlow Session\n", " - 3.3. Data exploration\n", - " - 3.4. Creation of logical entities of feature group\n", - " - 3.4.1 Creation of feature store\n", - " - 3.4.2 Creation of entity\n", - " - 3.4.3 Creation of feature group\n", - " - 3.4.4 Materialisation of feature group\n", - " - 3.4.5 Querying of feature group\n", + " - 3.4. Create Feature Store Logical Entities\n", + " - 3.4.1 Creating a feature store\n", + " - 3.4.2 Creating an entity\n", + " - 3.4.3 Creating a feature group\n", + " - 3.4.4 Materialising a Feature Group\n", + " - 3.4.5 Querying a Feature group\n", "- 4. References\n", "\n", "---\n", @@ -65,20 +85,9 @@ "Compatible conda pack: [PySpark 3.2 and Feature Store](https://docs.oracle.com/iaas/data-science/using/conda-pyspark-fam.htm) for CPU on Python 3.8 (version 1.0)\n" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "e8463c70", - "metadata": {}, - "outputs": [], - "source": [ - "# Upgrade Oracle ADS to pick up the latest preview version to maintain compatibility with Oracle Cloud Infrastructure.\n", - "!pip install --pre --no-deps oracle-ads==2.9.0rc0" - ] - }, { "cell_type": "markdown", - "id": "496a7e98", + "id": "cd84d936", "metadata": { "pycharm": { "name": "#%% md\n" @@ -88,29 +97,29 @@ "\n", "# 1. Introduction\n", "\n", - "Oracle feature store is a stack based solution that is deployed in the customer enclave using OCI resource manager. Customer can stand up the service with infrastructure in their own tenancy. The service consists of API which are deployed in customer tenancy using resource manager.\n", + "OCI Data Science feature store is a stack-based API solution that's deployed using OCI Resource Manager in your tenancy.\n", "\n", - "The following are some key terms that will help you understand OCI Data Science Feature Store:\n", + "Review the following key terms to understand the Data Science feature store:\n", "\n", "\n", - "* **Feature Vector**: Set of feature values for any one primary/identifier key. Eg. All/subset of features of customer id ‘2536’ can be called as one feature vector.\n", + "* **Feature Vector**: Set of feature values for any one primary or identifier key. For example, all or a subset of features of customer id ‘2536’ can be called as one feature vector.\n", "\n", "* **Feature**: A feature is an individual measurable property or characteristic of a phenomenon being observed.\n", "\n", - "* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Another way to look at it is that an entity is an object or concept that is described by its features. Examples of entities could be customer, product, transaction, review, image, document, etc.\n", + "* **Entity**: An entity is a group of semantically related features. The first step a consumer of features would typically do when accessing the feature store service is to list the entities and the entities associated features. Or, an entity is an object or concept that is described by its features. Examples of entities are customer, product, transaction, review, image, document, and so on.\n", "\n", - "* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in ml models. It serves as an organizational unit within the feature store for users to manage, version and share features across different ml projects. By organizing features into groups, data scientists and ml engineers can efficiently discover, reuse and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.\n", + "* **Feature Group**: A feature group in a feature store is a collection of related features that are often used together in machine learning (ML) models. It serves as an organizational unit within the feature store for you to manage, version, and share features across different ML projects. By organizing features into groups, data scientists and ML engineers can efficiently discover, reuse, and collaborate on features reducing the redundant work and ensuring consistency in feature engineering.\n", "\n", - "* **Feature Group Job**: Feature group job is the execution instance of a feature group. Each feature group job will include validation results and statistics results.\n", + "* **Feature Group Job**: A feature group job is the processing instance of a feature group. Each feature group job includes validation and statistics results.\n", "\n", - "* **Dataset**: A dataset is a collection of feature that are used together to either train a model or perform model inference.\n", + "* **Dataset**: A dataset is a collection of features that are used together to either train a model or perform model inference.\n", "\n", - "* **Dataset Job**: Dataset job is the execution instance of a dataset. Each dataset job will include validation results and statistics results." + "* **Dataset Job**: A dataset job is the processing instance of a dataset. Each dataset job includes validation and statistics results." ] }, { "cell_type": "markdown", - "id": "22bcec86", + "id": "76acf33b", "metadata": { "pycharm": { "name": "#%% md\n" @@ -120,16 +129,14 @@ "\n", "# 2. Pre-requisites to Running this Notebook\n", "\n", - "Data Flow Sessions are accessible through the following conda environment: \n", - "\n", - "* **PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1)**\n", + "Data Flow Sessions are accessible using the PySpark 3.2 and Feature Store Python 3.8 (fspyspark32_p38_cpu_v1) conda environment.\n", "\n", "The [Data Catalog Hive Metastore](https://docs.oracle.com/en-us/iaas/data-catalog/using/metastore.htm) provides schema definitions for objects in structured and unstructured data assets. The Metastore is the central metadata repository to understand tables backed by files on object storage. You can customize `fs_pyspark32_p38_cpu_v1`, publish it, and use it as a runtime environment for a Data Flow session cluster. The metastore id of hive metastore is tied to feature store construct of feature store service." ] }, { "cell_type": "markdown", - "id": "28fd82db", + "id": "f1e2b6a1", "metadata": { "pycharm": { "name": "#%% md\n" @@ -148,7 +155,7 @@ }, { "cell_type": "markdown", - "id": "d92fd7df", + "id": "2207bfb3", "metadata": { "pycharm": { "name": "#%% md\n" @@ -157,13 +164,13 @@ "source": [ "\n", "## 2.2 Helpers\n", - "This section provides a helper method used across the notebook to prepare arguments for the magic commands. This function is particularly useful when you want to pass Python variables as arguments to the spark magic commands " + "This helper method is used across the notebook to prepare arguments for the magic commands. This function is particularly useful when you want to pass Python variables as arguments to the spark magic commands." ] }, { "cell_type": "code", "execution_count": null, - "id": "3c535a71", + "id": "32894857", "metadata": { "pycharm": { "name": "#%%\n" @@ -181,7 +188,7 @@ }, { "cell_type": "markdown", - "id": "4d699131", + "id": "4b610391", "metadata": { "pycharm": { "name": "#%% md\n" @@ -191,13 +198,13 @@ "\n", "## 2.3. Authentication\n", "The [Oracle Accelerated Data Science SDK (ADS)](https://docs.oracle.com/iaas/tools/ads-sdk/latest/index.html) controls the authentication mechanism with the Data Flow Session Spark cluster.
\n", - "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```. " + "To setup authentication use the ```ads.set_auth(\"resource_principal\")``` or ```ads.set_auth(\"api_key\")```. For example:" ] }, { "cell_type": "code", "execution_count": null, - "id": "0ed15b93", + "id": "ae1080f3", "metadata": { "pycharm": { "name": "#%%\n" @@ -212,7 +219,7 @@ }, { "cell_type": "markdown", - "id": "44d0c3f1", + "id": "3aa57661", "metadata": { "pycharm": { "name": "#%% md\n" @@ -229,7 +236,7 @@ { "cell_type": "code", "execution_count": null, - "id": "c3ab9476", + "id": "7157113d", "metadata": { "pycharm": { "name": "#%%\n" @@ -247,7 +254,7 @@ }, { "cell_type": "markdown", - "id": "835fa366", + "id": "426be51d", "metadata": { "pycharm": { "name": "#%% md\n" @@ -256,18 +263,18 @@ "source": [ "\n", "# 3. Data Flow Spark Magic\n", - "Data Flow Spark Magic commands allow you to interactively work with Data Flow Spark clusters (sessions) in Jupyter notebooks through the Livy REST API. It provides a set of Jupyter Notebook cell magic commands to turn Jupyter into an integrated Spark development environment for remote clusters. \n", + "Data Flow Spark Magic commands allow you to interactively work with Data Flow Spark clusters (sessions) in Jupyter notebooks using the Livy REST API. The commands provide a set of Jupyter notebook cell magic commands to turn Jupyter into an integrated Spark development environment for remote clusters. \n", "\n", - "**Data Flow Magic allows you to:**\n", + "**Data Flow Spark Magic allows you to:**\n", "\n", - "* Run Spark code against Data Flow remote Spark cluster\n", - "* Create a Data Flow Spark Session with SparkContext and HiveContext against Data Flow remote Spark cluster\n", - "* Capture the output of Spark queries as a local Pandas data frame to interact easily with other Python libraries (e.g. matplotlib)" + "* Run Spark code against a Data Flow remote Spark cluster.\n", + "* Create a Data Flow Spark session with SparkContext and HiveContext against Data Flow remote Spark cluster.\n", + "* Capture the output of Spark queries as a local Pandas dataframe to interact with other Python libraries (such as matplotlib)." ] }, { "cell_type": "markdown", - "id": "1b977b2b", + "id": "a1f403f7", "metadata": { "pycharm": { "name": "#%% md\n" @@ -277,13 +284,13 @@ "\n", "### 3.1. Load Spark Magic Commands and Getting Help\n", "Data Flow Spark Magic is a JupyterLab extension that you need to activate in your notebook using the `%load_ext dataflow.magics` magic command.
\n", - "After the extension is activated, the `%help` command can be used to get the list of supported commands." + "After the extension is activated, you can use the `%help` command to view the list of supported commands." ] }, { "cell_type": "code", "execution_count": null, - "id": "da895c49", + "id": "4d61b5fa", "metadata": { "pycharm": { "name": "#%%\n" @@ -296,7 +303,7 @@ }, { "cell_type": "markdown", - "id": "f48aa78c", + "id": "ec076494", "metadata": { "pycharm": { "name": "#%% md\n" @@ -304,14 +311,14 @@ }, "source": [ "\n", - "### 3.2. Create Session\n", - "To create a new Data Flow cluster session use the `%create_session` magic command." + "### 3.2. Create DataFlow Session.\n", + "Create a new Data Flow cluster session using the `%create_session` magic command." ] }, { "cell_type": "code", "execution_count": null, - "id": "79775a26", + "id": "1aba2243", "metadata": { "pycharm": { "name": "#%%\n" @@ -346,7 +353,7 @@ { "cell_type": "code", "execution_count": null, - "id": "a00cb706", + "id": "0aad36e2", "metadata": { "pycharm": { "name": "#%%\n" @@ -377,7 +384,7 @@ }, { "cell_type": "markdown", - "id": "f43808d1", + "id": "9e9fafb3", "metadata": { "pycharm": { "name": "#%% md\n" @@ -391,7 +398,7 @@ { "cell_type": "code", "execution_count": null, - "id": "90465a30", + "id": "0dde877c", "metadata": { "pycharm": { "name": "#%%\n" @@ -408,7 +415,7 @@ }, { "cell_type": "markdown", - "id": "47927e0f", + "id": "6180b15c", "metadata": { "pycharm": { "name": "#%% md\n" @@ -421,7 +428,7 @@ }, { "cell_type": "markdown", - "id": "8a5cd5b9", + "id": "c54f9744", "metadata": { "pycharm": { "name": "#%% md\n" @@ -429,14 +436,14 @@ }, "source": [ "\n", - "#### 3.4.1 Creation of Feature Store\n", + "#### 3.4.1. Creating a Feature Store\n", "Feature store is the top level entity for feature store service" ] }, { "cell_type": "code", "execution_count": null, - "id": "26067de7", + "id": "73b8d3a4", "metadata": { "pycharm": { "name": "#%%\n" @@ -457,7 +464,7 @@ }, { "cell_type": "markdown", - "id": "67bafdd2", + "id": "0569a4f9", "metadata": { "pycharm": { "name": "#%% md\n" @@ -465,14 +472,14 @@ }, "source": [ "\n", - "#### 3.4.2 Creation of Entity\n", + "#### 3.4.2. Creating an Entity\n", "An entity is a group of semantically related features." ] }, { "cell_type": "code", "execution_count": null, - "id": "657d3fe4", + "id": "b85d5002", "metadata": { "pycharm": { "name": "#%%\n" @@ -487,7 +494,7 @@ }, { "cell_type": "markdown", - "id": "47440cde", + "id": "0029a900", "metadata": { "pycharm": { "name": "#%% md\n" @@ -495,14 +502,14 @@ }, "source": [ "\n", - "#### 3.4.3 Creation of Feature group\n", + "#### 3.4.3. Creating a Feature group\n", "A feature group is an object that represents a logical group of time-series feature data as it is found in a datasource." ] }, { "cell_type": "code", "execution_count": null, - "id": "690aa9f7", + "id": "eab895fe", "metadata": { "pycharm": { "name": "#%%\n" @@ -537,7 +544,7 @@ }, { "cell_type": "markdown", - "id": "55482958", + "id": "147916a9", "metadata": { "pycharm": { "name": "#%% md\n" @@ -545,13 +552,13 @@ }, "source": [ "\n", - "#### 3.4.4 Materialisation of Feature group" + "#### 3.4.4. Materialising a Feature Group" ] }, { "cell_type": "code", "execution_count": null, - "id": "2896b65c", + "id": "1f26caf3", "metadata": { "pycharm": { "name": "#%%\n" @@ -569,7 +576,7 @@ }, { "cell_type": "markdown", - "id": "d7dee26d", + "id": "f39a2317", "metadata": { "pycharm": { "name": "#%% md\n" @@ -577,13 +584,13 @@ }, "source": [ "\n", - "#### 3.4.5 Feature group Querying" + "#### 3.4.5. Querying a Feature Group" ] }, { "cell_type": "code", "execution_count": null, - "id": "c44d5877", + "id": "ede99da4", "metadata": { "pycharm": { "name": "#%%\n" @@ -598,7 +605,7 @@ { "cell_type": "code", "execution_count": null, - "id": "feb2593e", + "id": "a32b4b1e", "metadata": { "pycharm": { "name": "#%%\n" @@ -613,7 +620,7 @@ { "cell_type": "code", "execution_count": null, - "id": "f7fb9882", + "id": "6de58a22", "metadata": { "pycharm": { "name": "#%%\n" @@ -627,7 +634,7 @@ }, { "cell_type": "markdown", - "id": "180bf37b", + "id": "c3dff2d1", "metadata": { "pycharm": { "name": "#%% md\n" @@ -635,7 +642,7 @@ }, "source": [ "\n", - "# References\n", + "# 4. References\n", "- [Feature Store Documentation](https://feature-store-accelerated-data-science.readthedocs.io/en/latest/overview.html)\n", "- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)\n", "- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)\n", @@ -646,7 +653,7 @@ { "cell_type": "code", "execution_count": null, - "id": "009f9008", + "id": "5f9babbb", "metadata": { "pycharm": { "name": "#%%\n"