diff --git a/docs/content/en/latest/pipelines/ldm_extension/_index.md b/docs/content/en/latest/pipelines/ldm_extension/_index.md new file mode 100644 index 000000000..7435b59b8 --- /dev/null +++ b/docs/content/en/latest/pipelines/ldm_extension/_index.md @@ -0,0 +1,203 @@ +--- +title: "LDM Extension" +linkTitle: "LDM Extension" +weight: 3 +no_list: true +--- + +Child workspaces inherit [Logical Data Model](https://www.gooddata.com/docs/cloud/model-data/concepts/logical-data-model/) (LDM) from their parent. You can use GoodData Pipelines to extend child workspace's LDM with extra datasets specific to the tenant requirements. + +{{% alert color="info" %}} See [Set Up Multiple Tenants](https://www.gooddata.com/docs/cloud/workspaces/) to learn more about leveraging multitenancy in GoodData.{{% /alert %}} + +This documentation operates with terms like *custom datasets* and *custom fields*. Within this context, *custom* refers to extension of the LDM beyond inherited datasets. + +## Usage + +Start by initializing the LdmExtensionManager: + +```python +from gooddata_pipelines import LdmExtensionManager + +host = "http://localhost:3000" +token = "some_user_token" + +ldm_extension_manager = LdmExtensionManager.create(host=host, token=token) + +``` + +To extend the LDM, you need to define the custom datasets and the fields they should contain. The script also checks the validity of analytical objects before and after the update. Updates introducing new invalid relations are automatically rolled back. You can opt out of this behavior by setting the `check_relations` parameter to False. + +### Custom Dataset Definitions + +The custom dataset represents a new dataset appended to the child LDM. It is defined by the following parameters: + +| name | type | description | +|------|------|-------------| +| workspace_id | string | ID of the child workspace. | +| dataset_id | string | ID of the custom dataset. | +| dataset_name | string | Name of the custom dataset. | +| dataset_datasource_id | string | ID of the data source. | +| dataset_source_table | string | Name of the table in the Physical Data Model. | +| dataset_source_sql | string \| None | SQL query defining the dataset. | +| parent_dataset_reference | string \| None | ID of the parent dataset to which the custom one will be connected. | +| parent_dataset_reference_attribute_id | string | ID of the attribute used for creating the relationship in the parent dataset. | +| dataset_reference_source_column | string | Name of the column used for creating the relationship in the custom dataset. | +| dataset_reference_source_column_data_type | [ColumnDataType](#columndatatype) | Column data type. | +| workspace_data_filter_id | string | ID of the workspace data filter to use. | +| workspace_data_filter_column_name | string | Name of the column in custom dataset used for filtering. | + +#### Validity constraints + +Either `dataset_source_table` or `dataset_source_sql` must be specified with a truthy value, but not both. An exception will be raised if both parameters are falsy or if both have truthy values. + +### Custom Field Definitions + +The custom fields define the individual fields in the custom datasets defined above. Each custom field needs to be specified with the following parameters: + +| name | type | description | +|---------------|----------|-----------------| +| workspace_id | string | ID of the child workspace. | +| dataset_id | string | ID of the custom dataset. | +| custom_field_id | string | ID of the custom field. | +| custom_field_name | string | Name of the custom field. | +| custom_field_type | [CustomFieldType](#customfieldtype) | Indicates whether the field represents an attribute, a date, or a fact. | +| custom_field_source_column | string | Name of the column in the physical data model. | +| custom_field_source_column_data_type | [ColumnDataType](#columndatatype) | Data type of the field. | + +#### Validity constraints + +The custom field definitions must comply with the following criteria: + +- Each attribute and fact must have a unique combination of `workspace_id` and `custom_field_id` values. +- Each date must have a unique combination of `dataset_id` and `custom_field_id` values. + +### Enumerations + +Some parameters of custom fields' and datasets' definitions are specified via CustomFieldType and ColumnDataType enums. + +#### CustomFieldType + +The following field types are supported: + +| name | value | +|------|-------| +| ATTRIBUTE | "attribute" | +| FACT | "fact" | +| DATE | "date" | + +#### ColumnDataType + +The following data types are supported: + +| name | value | +|------|-------| +| INT | "INT" | +| STRING | "STRING" | +| DATE | "DATE" | +| NUMERIC | "NUMERIC" | +| TIMESTAMP | "TIMESTAMP" | +| TIMESTAMP_TZ | "TIMESTAMP_TZ" | +| BOOLEAN | "BOOLEAN" | + +### Relations Check + +As changes to the LDM may impact existing analytical objects, the script will perform checks to prevent potentially undesirable changes. + +{{% alert color="warning" %}} Changes to the LDM can invalidate existing objects. For example, removing a previously added custom field will break any analytical objects using that field. {{% /alert %}} + +To prevent this, the script will: + +1. Store current workspace layout (analytical objects and LDM). +1. Check whether relations of metrics, visualizations and dashboards are valid. A set of current objects with invalid relations is created. +1. Push the updated LDM to GoodData Cloud. +1. Check object relations again. New set of objects with invalid relations is created. +1. The sets are compared: + - If the new set is a subset of the old one, the update is considered successful. + - Otherwise, the update is rolled back. The initially stored workspace layout will be pushed to GoodData Cloud again, reverting changes to the workspace. + +You can opt out of this check and rollback behavior by setting `check_relations` parameter to `False` when using the LdmExtensionManager. + +```python +# By setting the `check_relations` to False, you will bypass the default checks +# and rollback mechanism. Note that this may invalidate existing objects. +ldm_extension_manager.process( + custom_datasets=custom_dataset_definitions, + custom_fields=custom_field_definitions, + check_relations=False, +) + +``` + +## Example + +Here is a complete example of extending a child workspace's LDM: + +```python +from gooddata_pipelines import ( + ColumnDataType, + CustomDatasetDefinition, + CustomFieldDefinition, + CustomFieldType, + LdmExtensionManager, +) + +import logging + +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +host = "http://localhost:3000" +token = "some_user_token" + +# Initialize the manager +ldm_extension_manager = LdmExtensionManager.create(host=host, token=token) + +# Optionally, you can subscribe to the logger object to receive log messages +ldm_extension_manager.logger.subscribe(logger) + +# Prepare the definitions +custom_dataset_definitions = [ + CustomDatasetDefinition( + workspace_id="child_workspace_id", + dataset_id="products_custom_dataset_id", + dataset_name="Custom Products Dataset", + dataset_datasource_id="gdc_datasource_id", + dataset_source_table="products_custom", + dataset_source_sql=None, + parent_dataset_reference="products", + parent_dataset_reference_attribute_id="products.product_id", + dataset_reference_source_column="product_id", + dataset_reference_source_column_data_type=ColumnDataType.INT, + workspace_data_filter_id="wdf_id", + workspace_data_filter_column_name="wdf_column", + ) +] + +custom_field_definitions = [ + CustomFieldDefinition( + workspace_id="child_workspace_id", + dataset_id="products_custom_dataset_id", + custom_field_id="is_sold_out", + custom_field_name="Sold Out", + custom_field_type=CustomFieldType.ATTRIBUTE, + custom_field_source_column="is_sold_out", + custom_field_source_column_data_type=ColumnDataType.BOOLEAN, + ), + CustomFieldDefinition( + workspace_id="child_workspace_id", + dataset_id="products_custom_dataset_id", + custom_field_id="category_detail", + custom_field_name="Category (Detail)", + custom_field_type=CustomFieldType.ATTRIBUTE, + custom_field_source_column="category_detail", + custom_field_source_column_data_type=ColumnDataType.STRING, + ), +] + +# Call the process method to extend the LDM +ldm_extension_manager.process( + custom_datasets=custom_dataset_definitions, + custom_fields=custom_field_definitions, +) + +``` diff --git a/gooddata-pipelines/README.md b/gooddata-pipelines/README.md index 6d9d777f5..f79ea2b7b 100644 --- a/gooddata-pipelines/README.md +++ b/gooddata-pipelines/README.md @@ -57,4 +57,12 @@ full_load_data: list[UserFullLoad] = UserFullLoad.from_list_of_dicts( provisioner.full_load(full_load_data) ``` -Ready-made scripts covering the basic use cases can be found here in the [GoodData Productivity Tools](https://github.com/gooddata/gooddata-productivity-tools) repository +## Bugs & Requests + +Please use the [GitHub issue tracker](https://github.com/gooddata/gooddata-python-sdk/issues) to submit bugs +or request features. + +## Changelog + +See [Github releases](https://github.com/gooddata/gooddata-python-sdk/releases) for released versions +and a list of changes. diff --git a/gooddata-pipelines/gooddata_pipelines/__init__.py b/gooddata-pipelines/gooddata_pipelines/__init__.py index c7a1bbed8..a2c4792df 100644 --- a/gooddata-pipelines/gooddata_pipelines/__init__.py +++ b/gooddata-pipelines/gooddata_pipelines/__init__.py @@ -13,6 +13,15 @@ from .backup_and_restore.storage.local_storage import LocalStorage from .backup_and_restore.storage.s3_storage import S3Storage +# -------- LDM Extension -------- +from .ldm_extension.ldm_extension_manager import LdmExtensionManager +from .ldm_extension.models.custom_data_object import ( + ColumnDataType, + CustomDatasetDefinition, + CustomFieldDefinition, + CustomFieldType, +) + # -------- Provisioning -------- from .provisioning.entities.user_data_filters.models.udf_models import ( UserDataFilterFullLoad, @@ -65,5 +74,10 @@ "UserDataFilterProvisioner", "UserDataFilterFullLoad", "EntityType", + "LdmExtensionManager", + "CustomDatasetDefinition", + "CustomFieldDefinition", + "ColumnDataType", + "CustomFieldType", "__version__", ] diff --git a/gooddata-pipelines/gooddata_pipelines/api/gooddata_api.py b/gooddata-pipelines/gooddata_pipelines/api/gooddata_api.py index 93ffb5487..38c72476f 100644 --- a/gooddata-pipelines/gooddata_pipelines/api/gooddata_api.py +++ b/gooddata-pipelines/gooddata_pipelines/api/gooddata_api.py @@ -174,6 +174,44 @@ def get_automations(self, workspace_id: str) -> requests.Response: ) return self._get(endpoint) + def get_all_metrics(self, workspace_id: str) -> requests.Response: + """Get all metrics from the specified workspace. + + Args: + workspace_id (str): The ID of the workspace to retrieve metrics from. + Returns: + requests.Response: The response containing the metrics. + """ + endpoint = f"/entities/workspaces/{workspace_id}/metrics" + headers = {**self.headers, "X-GDC-VALIDATE-RELATIONS": "true"} + return self._get(endpoint, headers=headers) + + def get_all_visualization_objects( + self, workspace_id: str + ) -> requests.Response: + """Get all visualizations from the specified workspace. + + Args: + workspace_id (str): The ID of the workspace to retrieve visualizations from. + Returns: + requests.Response: The response containing the visualizations. + """ + endpoint = f"/entities/workspaces/{workspace_id}/visualizationObjects" + headers = {**self.headers, "X-GDC-VALIDATE-RELATIONS": "true"} + return self._get(endpoint, headers=headers) + + def get_all_dashboards(self, workspace_id: str) -> requests.Response: + """Get all dashboards from the specified workspace. + + Args: + workspace_id (str): The ID of the workspace to retrieve dashboards from. + Returns: + requests.Response: The response containing the dashboards. + """ + endpoint = f"/entities/workspaces/{workspace_id}/analyticalDashboards" + headers = {**self.headers, "X-GDC-VALIDATE-RELATIONS": "true"} + return self._get(endpoint, headers=headers) + def _get( self, endpoint: str, headers: dict[str, str] | None = None ) -> requests.Response: @@ -253,3 +291,15 @@ def _delete( url = self._get_url(endpoint) return requests.delete(url, headers=self.headers, timeout=TIMEOUT) + + @staticmethod + def raise_if_response_not_ok(*responses: requests.Response) -> None: + """Check if responses from API calls are OK. + + Raises ValueError if any response is not OK (status code not 2xx). + """ + for response in responses: + if not response.ok: + raise ValueError( + f"Request to {response.url} failed with status code {response.status_code}: {response.text}" + ) diff --git a/gooddata-pipelines/gooddata_pipelines/ldm_extension/__init__.py b/gooddata-pipelines/gooddata_pipelines/ldm_extension/__init__.py new file mode 100644 index 000000000..37d863d60 --- /dev/null +++ b/gooddata-pipelines/gooddata_pipelines/ldm_extension/__init__.py @@ -0,0 +1 @@ +# (C) 2025 GoodData Corporation diff --git a/gooddata-pipelines/gooddata_pipelines/ldm_extension/input_processor.py b/gooddata-pipelines/gooddata_pipelines/ldm_extension/input_processor.py new file mode 100644 index 000000000..64c5611d2 --- /dev/null +++ b/gooddata-pipelines/gooddata_pipelines/ldm_extension/input_processor.py @@ -0,0 +1,286 @@ +# (C) 2025 GoodData Corporation +"""Module for processing validated custom datasets and fields data. + +This module is responsible for converting validated custom datasets and fields +into objects defined in the GoodData Python SDK. +""" + +from gooddata_sdk.catalog.identifier import ( + CatalogDatasetWorkspaceDataFilterIdentifier, + CatalogGrainIdentifier, + CatalogReferenceIdentifier, +) +from gooddata_sdk.catalog.workspace.declarative_model.workspace.logical_model.data_filter_references import ( + CatalogDeclarativeWorkspaceDataFilterReferences, +) +from gooddata_sdk.catalog.workspace.declarative_model.workspace.logical_model.dataset.dataset import ( + CatalogDataSourceTableIdentifier, + CatalogDeclarativeAttribute, + CatalogDeclarativeDataset, + CatalogDeclarativeDatasetSql, + CatalogDeclarativeFact, + CatalogDeclarativeReference, + CatalogDeclarativeReferenceSource, + CatalogDeclarativeWorkspaceDataFilterColumn, +) +from gooddata_sdk.catalog.workspace.declarative_model.workspace.logical_model.date_dataset.date_dataset import ( + CatalogDeclarativeDateDataset, + CatalogGranularitiesFormatting, +) +from gooddata_sdk.catalog.workspace.declarative_model.workspace.logical_model.ldm import ( + CatalogDeclarativeLdm, + CatalogDeclarativeModel, +) + +from gooddata_pipelines.ldm_extension.models.aliases import DatasetId +from gooddata_pipelines.ldm_extension.models.custom_data_object import ( + ColumnDataType, + CustomDataset, + CustomFieldDefinition, + CustomFieldType, +) + + +class LdmExtensionDataProcessor: + """Create GoodData LDM from validated custom datasets and fields.""" + + DATE_GRANULARITIES: list[str] = [ + "MINUTE", + "HOUR", + "DAY", + "WEEK", + "MONTH", + "QUARTER", + "YEAR", + "MINUTE_OF_HOUR", + "HOUR_OF_DAY", + "DAY_OF_WEEK", + "DAY_OF_MONTH", + "DAY_OF_YEAR", + "WEEK_OF_YEAR", + "MONTH_OF_YEAR", + "QUARTER_OF_YEAR", + ] + + @staticmethod + def _attribute_from_field( + dataset_name: str, + custom_field: CustomFieldDefinition, + ) -> CatalogDeclarativeAttribute: + """Assign a declarative attribute from a custom field definition.""" + return CatalogDeclarativeAttribute( + id=custom_field.custom_field_id, + title=custom_field.custom_field_name, + source_column=custom_field.custom_field_source_column, + labels=[], + source_column_data_type=custom_field.custom_field_source_column_data_type.value, + tags=[dataset_name], + ) + + @staticmethod + def _fact_from_field( + dataset_name: str, + custom_field: CustomFieldDefinition, + ) -> CatalogDeclarativeFact: + """Assign a declarative fact from a custom field definition.""" + return CatalogDeclarativeFact( + id=custom_field.custom_field_id, + title=custom_field.custom_field_name, + source_column=custom_field.custom_field_source_column, + source_column_data_type=custom_field.custom_field_source_column_data_type.value, + tags=[dataset_name], + ) + + def _date_from_field( + self, + dataset_name: str, + custom_field: CustomFieldDefinition, + ) -> CatalogDeclarativeDateDataset: + """Assign a declarative date dataset from a custom field definition.""" + + return CatalogDeclarativeDateDataset( + id=custom_field.custom_field_id, + title=custom_field.custom_field_name, + granularities_formatting=CatalogGranularitiesFormatting( + title_base="", + title_pattern="%titleBase - %granularityTitle", + ), + granularities=self.DATE_GRANULARITIES, + tags=[dataset_name], + ) + + @staticmethod + def _date_ref_from_field( + custom_field: CustomFieldDefinition, + ) -> CatalogDeclarativeReference: + """Create a date reference from a custom field definition.""" + return CatalogDeclarativeReference( + identifier=CatalogReferenceIdentifier( + id=custom_field.custom_field_id + ), + multivalue=False, + sources=[ + CatalogDeclarativeReferenceSource( + column=custom_field.custom_field_source_column, + target=CatalogGrainIdentifier( + id=custom_field.custom_field_id, + type=CustomFieldType.DATE.value, + ), + data_type=custom_field.custom_field_source_column_data_type.value, + ) + ], + ) + + @staticmethod + def _get_sources( + dataset: CustomDataset, + ) -> tuple[ + CatalogDataSourceTableIdentifier | None, + CatalogDeclarativeDatasetSql | None, + ]: + """Get the data source table and SQL from the dataset definition.""" + # We will have either a table id or a sql statement. Let's store + # whatever data is available to variables and pass it to the + # dataset. Both can be object instances or None, but at least one + # should be valid as per prior validation. + dataset_source_table_id = ( + CatalogDataSourceTableIdentifier( + id=dataset.definition.dataset_source_table, + data_source_id=dataset.definition.dataset_datasource_id, + path=[dataset.definition.dataset_source_table], + ) + if dataset.definition.dataset_source_table + else None + ) + + dataset_sql = ( + CatalogDeclarativeDatasetSql( + statement=dataset.definition.dataset_source_sql, + data_source_id=dataset.definition.dataset_datasource_id, + ) + if dataset.definition.dataset_source_sql + else None + ) + return dataset_source_table_id, dataset_sql + + def datasets_to_ldm( + self, datasets: dict[DatasetId, CustomDataset] + ) -> CatalogDeclarativeModel: + """Convert validated datasets to GoodData declarative model. + + Args: + datasets (dict[DatasetId, CustomDataset]): Dictionary of validated + datasets. + Returns: + CatalogDeclarativeModel: GoodData declarative model representation + of the datasets. + """ + + declarative_datasets: list[CatalogDeclarativeDataset] = [] + + # Date dimensions are not stored in a dataset, but as a separate datasets + # in `date_instances` object on the LDM + date_instances: list[CatalogDeclarativeDateDataset] = [] + + for dataset in datasets.values(): + date_references: list[CatalogDeclarativeReference] = [] + attributes: list[CatalogDeclarativeAttribute] = [] + facts: list[CatalogDeclarativeFact] = [] + + # Iterate through the custom fields and create the appropriate objects + for custom_field in dataset.custom_fields: + if custom_field.custom_field_type == CustomFieldType.ATTRIBUTE: + attributes.append( + self._attribute_from_field( + dataset.definition.dataset_name, custom_field + ) + ) + + elif custom_field.custom_field_type == CustomFieldType.FACT: + facts.append( + self._fact_from_field( + dataset.definition.dataset_name, custom_field + ) + ) + + # Process date dimensions and store them to date_instances. Date + # dimensions are not stored in a dataset, but as a separate dataset. + # However, they need to be referenced in the dataset references to + # create the connection between the dataset and the date dimension + # in the GoodData Logical Data Model. + elif custom_field.custom_field_type == CustomFieldType.DATE: + # Add the date dimension to the date_instances + date_instances.append( + self._date_from_field( + dataset.definition.dataset_name, custom_field + ) + ) + + # Create a reference so that the date dimension is connected + # to the dataset in the GoodData Logical Data Model. + date_references.append( + self._date_ref_from_field(custom_field) + ) + + else: + raise ValueError( + f"Unsupported custom field type: {custom_field.custom_field_type}" + ) + + # Get the data source info + dataset_source_table_id, dataset_sql = self._get_sources(dataset) + + # Construct the declarative dataset object and append it to the list. + declarative_datasets.append( + CatalogDeclarativeDataset( + id=dataset.definition.dataset_id, + title=dataset.definition.dataset_name, + grain=[], + references=[ + CatalogDeclarativeReference( + identifier=CatalogReferenceIdentifier( + id=dataset.definition.parent_dataset_reference, + ), + multivalue=True, + sources=[ + CatalogDeclarativeReferenceSource( + column=dataset.definition.dataset_reference_source_column, + data_type=dataset.definition.dataset_reference_source_column_data_type.value, + target=CatalogGrainIdentifier( + id=dataset.definition.parent_dataset_reference_attribute_id, + type=CustomFieldType.ATTRIBUTE.value, + ), + ) + ], + ), + ] + + date_references, + description=None, + attributes=attributes, + facts=facts, + data_source_table_id=dataset_source_table_id, + sql=dataset_sql, + workspace_data_filter_columns=[ + CatalogDeclarativeWorkspaceDataFilterColumn( + name=dataset.definition.workspace_data_filter_column_name, + data_type=ColumnDataType.STRING.value, + ) + ], + workspace_data_filter_references=[ + CatalogDeclarativeWorkspaceDataFilterReferences( + filter_id=CatalogDatasetWorkspaceDataFilterIdentifier( + id=dataset.definition.workspace_data_filter_id + ), + filter_column=dataset.definition.workspace_data_filter_column_name, + filter_column_data_type=ColumnDataType.STRING.value, + ) + ], + tags=[dataset.definition.dataset_name], + ) + ) + + # Create the Logical Data Model from the datasets and the date instances. + ldm = CatalogDeclarativeLdm( + datasets=declarative_datasets, date_instances=date_instances + ) + return CatalogDeclarativeModel(ldm=ldm) diff --git a/gooddata-pipelines/gooddata_pipelines/ldm_extension/input_validator.py b/gooddata-pipelines/gooddata_pipelines/ldm_extension/input_validator.py new file mode 100644 index 000000000..23bf23b78 --- /dev/null +++ b/gooddata-pipelines/gooddata_pipelines/ldm_extension/input_validator.py @@ -0,0 +1,185 @@ +# (C) 2025 GoodData Corporation +"""Module for validating custom fields input data. + +This module is responsible for validating custom fields input data checking for +row level and aggregated constraints. +""" + +from collections import Counter +from typing import Any, TypeVar + +from pydantic import BaseModel + +from gooddata_pipelines.ldm_extension.models.aliases import ( + DatasetId, + WorkspaceId, +) +from gooddata_pipelines.ldm_extension.models.custom_data_object import ( + CustomDataset, + CustomDatasetDefinition, + CustomFieldDefinition, + CustomFieldType, +) + + +class LdmExtensionDataValidator: + ModelT = TypeVar("ModelT", bound=BaseModel) + + def validate( + self, + dataset_definitions: list[CustomDatasetDefinition], + field_definitions: list[CustomFieldDefinition], + ) -> dict[WorkspaceId, dict[DatasetId, CustomDataset]]: + """Validate dataset and field definitions. + + Validates the dataset definitions and field definitions by using Pydantic + models to check row level constraints, then aggregates the definitions + per workspace, while checking for integrity on aggregated level, i.e., + uniqueness of combinations of identifieres on workspace level. + + Args: + raw_dataset_definitions (list[dict[str, str]]): List of raw dataset definitions to validate. + raw_field_definitions (list[dict[str, str]]): List of raw field definitions to validate. + Returns: + dict[WorkspaceId, dict[DatasetId, CustomDataset]]: + Dictionary of validated dataset definitions per workspace, + where each dataset contains its custom fields: + ```python + { + "workspace_id_1": { + "dataset_id_1": CustomDataset(...), + "dataset_id_2": CustomDataset(...), + }, + ... + } + ``` + """ + + # First, validate the dataset definitions and aggregate them per workspace. + validated_data = self._validate_dataset_definitions(dataset_definitions) + + # Then validate the field definitions and connect them to the datasets + validated_data = self._validate_field_definitions( + validated_data, field_definitions + ) + + return validated_data + + def _validate_dataset_definitions( + self, + dataset_definitions: list[CustomDatasetDefinition], + ) -> dict[WorkspaceId, dict[DatasetId, CustomDataset]]: + self._check_dataset_combinations(dataset_definitions) + + validated_definitions: dict[ + WorkspaceId, dict[DatasetId, CustomDataset] + ] = {} + for definition in dataset_definitions: + validated_definitions.setdefault(definition.workspace_id, {})[ + definition.dataset_id + ] = CustomDataset(definition=definition, custom_fields=[]) + + return validated_definitions + + def _check_dataset_combinations( + self, dataset_definitions: list[CustomDatasetDefinition] + ) -> None: + """Check integrity of provided dataset definitions. + + Validation criteria: + - workspace_id + dataset_id must be unique across all dataset definitions. + + Args: + dataset_definitions (list[CustomDatasetDefinition]): List of dataset definitions to check. + Raises: + ValueError: If there are duplicate dataset definitions based on workspace_id and dataset_id. + """ + workspace_dataset_combinations = [ + (definition.workspace_id, definition.dataset_id) + for definition in dataset_definitions + ] + if len(workspace_dataset_combinations) != len( + set(workspace_dataset_combinations) + ): + duplicates = self._get_duplicates(workspace_dataset_combinations) + raise ValueError( + "Duplicate dataset definitions found in the raw dataset " + + f"definitions (workspace_id, dataset_id): {duplicates}" + ) + + @staticmethod + def _get_duplicates(list_to_check: list[Any]) -> list[Any]: + """Get duplicates from a list. + + Args: + list_to_check (list[Any]): List of items to check for duplicates. + Returns: + list[Any]: List of duplicate items. + """ + counts = Counter(list_to_check) + return [item for item, count in counts.items() if count > 1] + + def _check_field_combinations( + self, field_definitions: list[CustomFieldDefinition] + ) -> None: + """Check integrity of provided field definitions. + + Validation criteria (per workspace): + - unique workspace_id + cf_id combinations (only for attribute and fact custom_field_type) + - there is no row with the same dataset_id and cf_id (only for date custom_field_type) + + Args: + field_definitions (list[CustomFieldDefinition]): List of field definitions to check. + Raises: + ValueError: If there are duplicate field definitions based on workspace_id and cf_id. + """ + workspace_field_combinations: set[tuple[str, str]] = set() + dataset_field_combinations: set[tuple[str, str]] = set() + + for field in field_definitions: + if field.custom_field_type in [ + CustomFieldType.ATTRIBUTE, + CustomFieldType.FACT, + ]: + combination = (field.workspace_id, field.custom_field_id) + if combination in workspace_field_combinations: + raise ValueError( + f"Duplicate custom field found for workspace {field.workspace_id} " + + f"with field ID {field.custom_field_id}" + ) + workspace_field_combinations.add(combination) + + elif field.custom_field_type == CustomFieldType.DATE: + combination = (field.dataset_id, field.custom_field_id) + if combination in dataset_field_combinations: + raise ValueError( + f"Duplicate custom field found for dataset {field.dataset_id} " + + f"with field ID {field.custom_field_id}" + ) + dataset_field_combinations.add(combination) + + def _validate_field_definitions( + self, + validated_definitions: dict[ + WorkspaceId, dict[DatasetId, CustomDataset] + ], + field_definitions: list[CustomFieldDefinition], + ) -> dict[WorkspaceId, dict[DatasetId, CustomDataset]]: + """Validates custom field definitions amd connects them to the datasets. + + Args: + validated_definitions (dict[WorkspaceId, dict[DatasetId, CustomDataset]]): + Dictionary of validated dataset definitions per workspace. + raw_field_definitions (list[dict[str, str]]): List of raw field definitions to validate. + Returns: + dict[WorkspaceId, dict[DatasetId, CustomDataset]]: + Updated dictionary of validated dataset definitions with custom fields added. + """ + self._check_field_combinations(field_definitions) + + for field_definition in field_definitions: + validated_definitions[field_definition.workspace_id][ + field_definition.dataset_id + ].custom_fields.append(field_definition) + + return validated_definitions diff --git a/gooddata-pipelines/gooddata_pipelines/ldm_extension/ldm_extension_manager.py b/gooddata-pipelines/gooddata_pipelines/ldm_extension/ldm_extension_manager.py new file mode 100644 index 000000000..f08f017e2 --- /dev/null +++ b/gooddata-pipelines/gooddata_pipelines/ldm_extension/ldm_extension_manager.py @@ -0,0 +1,283 @@ +# (C) 2025 GoodData Corporation +"""Module orchestrating the custom fields logic.""" + +from pathlib import Path + +from gooddata_sdk.sdk import GoodDataSdk +from gooddata_sdk.utils import PROFILES_FILE_PATH, profile_content + +from gooddata_pipelines.api import GoodDataApi +from gooddata_pipelines.ldm_extension.input_processor import ( + LdmExtensionDataProcessor, +) +from gooddata_pipelines.ldm_extension.input_validator import ( + LdmExtensionDataValidator, +) +from gooddata_pipelines.ldm_extension.models.aliases import ( + DatasetId, + WorkspaceId, +) +from gooddata_pipelines.ldm_extension.models.analytical_object import ( + AnalyticalObject, + AnalyticalObjects, +) +from gooddata_pipelines.ldm_extension.models.custom_data_object import ( + CustomDataset, + CustomDatasetDefinition, + CustomFieldDefinition, +) +from gooddata_pipelines.logger.logger import LogObserver + + +class LdmExtensionManager: + """Manager for creating custom datasets and fields in GoodData workspaces.""" + + INDENT = " " * 2 + + @classmethod + def create(cls, host: str, token: str) -> "LdmExtensionManager": + return cls(host=host, token=token) + + @classmethod + def create_from_profile( + cls, + profile: str = "default", + profiles_path: Path = PROFILES_FILE_PATH, + ) -> "LdmExtensionManager": + """Creates a provisioner instance using a GoodData profile file.""" + content = profile_content(profile, profiles_path) + return cls(host=content["host"], token=content["token"]) + + def __init__(self, host: str, token: str): + self._validator = LdmExtensionDataValidator() + self._processor = LdmExtensionDataProcessor() + self._sdk = GoodDataSdk.create(host_=host, token_=token) + self._api = GoodDataApi(host=host, token=token) + self.logger = LogObserver() + + def _get_objects_with_invalid_relations( + self, workspace_id: str + ) -> list[AnalyticalObject]: + """Check for invalid references in the provided analytical objects. + + This method checks if the references in the provided analytical objects + are valid. It returns a set of analytical objects that have invalid references. + + Args: + workspace_id (str): The ID of the workspace to check. + + Returns: + list[AnalyticalObject]: Set of analytical objects with invalid references. + """ + + analytical_objects: list[AnalyticalObject] = ( + self._get_analytical_objects(workspace_id=workspace_id) + ) + + objects_with_invalid_relations = [ + obj + for obj in analytical_objects + if not obj.attributes.are_relations_valid + ] + return objects_with_invalid_relations + + def _get_analytical_objects( + self, workspace_id: str + ) -> list[AnalyticalObject]: + """Get analytical objects in the workspace. + + This method retrieves all analytical objects (metrics, visualizations, dashboards) + in the specified workspace and returns them as a list. + + Args: + workspace_id (str): The ID of the workspace to retrieve objects from. + + Returns: + list[AnalyticalObject]: List of analytical objects in the workspace. + """ + metrics_response = self._api.get_all_metrics(workspace_id) + visualizations_response = self._api.get_all_visualization_objects( + workspace_id + ) + dashboards_response = self._api.get_all_dashboards(workspace_id) + + self._api.raise_if_response_not_ok( + metrics_response, + visualizations_response, + dashboards_response, + ) + metrics = AnalyticalObjects(**metrics_response.json()) + visualizations = AnalyticalObjects(**visualizations_response.json()) + dashboards = AnalyticalObjects(**dashboards_response.json()) + + return metrics.data + visualizations.data + dashboards.data + + @staticmethod + def _new_ldm_does_not_invalidate_relations( + current_invalid_relations: list[AnalyticalObject], + new_invalid_relations: list[AnalyticalObject], + ) -> bool: + """Check if the new LDM does not invalidate any new relations. + + This method compares the lists of analytical objects containing invalid + relations. It creates sets of object IDs for each list and compares them. + + If the set of new invalid relations is a subset of the set of current + invalid relations (that is before the changes to the LDM), the new LDM + does not invalidate any new relations and `True` is returned. + + If the set of new invalid relations is not a subset of the current one, + it means that the new LDM invalidates new relations and `False` is returned. + + Args: + current_invalid_relations (list[AnalyticalObject]): The current (before + changes to LDM) invalid relations. + new_invalid_relations (list[AnalyticalObject]): The new (after changes to + LDM) invalid relations. + + Returns: + bool: True if the new LDM does not invalidate any relations, False otherwise. + """ + # Create a set of IDs for each group, then compare those sets + set_current_invalid_relations = { + obj.id for obj in current_invalid_relations + } + set_new_invalid_relations = {obj.id for obj in new_invalid_relations} + + # If the set of new invalid relations is a subset of the current one, + return set_new_invalid_relations.issubset(set_current_invalid_relations) + + def _process_with_relations_check( + self, + validated_data: dict[WorkspaceId, dict[DatasetId, CustomDataset]], + ) -> None: + """Check whether relations of analytical objects are valid before and after + updating the LDM in the GoodData workspace. + """ + # Iterate through the workspaces. + for workspace_id, datasets in validated_data.items(): + self.logger.info(f"⚙️ Processing workspace {workspace_id}...") + # Get current workspace layout + current_layout = ( + self._sdk.catalog_workspace.get_declarative_workspace( + workspace_id + ) + ) + # Get a set of objects with invalid relations from current workspace state + current_invalid_relations = ( + self._get_objects_with_invalid_relations( + workspace_id=workspace_id + ) + ) + + # Put the LDM with custom datasets into the GoodData workspace. + self._sdk.catalog_workspace_content.put_declarative_ldm( + workspace_id=workspace_id, + ldm=self._processor.datasets_to_ldm(datasets), + ) + + # Get a set of objects with invalid relations from the new workspace state + new_invalid_relations = self._get_objects_with_invalid_relations( + workspace_id=workspace_id + ) + + if self._new_ldm_does_not_invalidate_relations( + current_invalid_relations, new_invalid_relations + ): + self._log_success_message(workspace_id) + continue + + self.logger.error( + f"❌ Difference in invalid relations found in workspace {workspace_id}." + ) + self._log_diff_invalid_relations( + current_invalid_relations, new_invalid_relations + ) + + self.logger.info( + f"{self.INDENT}⚠️ Reverting the workspace layout to the original state." + ) + # Put the original workspace layout back to the workspace + try: + self._sdk.catalog_workspace.put_declarative_workspace( + workspace_id=workspace_id, workspace=current_layout + ) + except Exception as e: + self.logger.error( + f"Failed to revert workspace layout in {workspace_id}: {e}" + ) + + def _log_diff_invalid_relations( + self, + current_invalid_relations: list[AnalyticalObject], + new_invalid_relations: list[AnalyticalObject], + ) -> None: + """Logs objects with newly invalid relations. + + Objects which previously did not have invalid relations, but do so after + updating the LDM, are logged. + """ + # TODO: test ! + diff_to_log: list[str] = [] + for obj in new_invalid_relations: + if obj not in current_invalid_relations: + diff_to_log.append( + f"{self.INDENT}∙ {obj.id} ({obj.type}) {obj.attributes.title}" + ) + joined_diff_to_log = "\n".join(diff_to_log) + error_message = f"{self.INDENT}Objects with newly invalidated relations:\n{joined_diff_to_log}" + + self.logger.error(error_message) + + def _process_without_relations_check( + self, + validated_data: dict[WorkspaceId, dict[DatasetId, CustomDataset]], + ) -> None: + """Update the LDM in the GoodData workspace without checking relations.""" + for workspace_id, datasets in validated_data.items(): + # Put the LDM with custom datasets into the GoodData workspace. + self._sdk.catalog_workspace_content.put_declarative_ldm( + workspace_id=workspace_id, + ldm=self._processor.datasets_to_ldm(datasets), + ) + self._log_success_message(workspace_id) + + def _log_success_message(self, workspace_id: str) -> None: + """Log a success message after updating the workspace LDM.""" + self.logger.info(f"✅ LDM in {workspace_id} updated successfully.") + + def process( + self, + custom_datasets: list[CustomDatasetDefinition], + custom_fields: list[CustomFieldDefinition], + check_relations: bool = True, + ) -> None: + """Create custom datasets and fields in GoodData workspaces. + + Creates custom datasets and fields to extend the Logical Data Model (LDM) + in GoodData workspaces based on the provided raw data definitions. The raw + data is validated by Pydantic models (CustomDatasetDefinition and CustomFieldDefinition). + The defined datasets and fields are then uploaded to GoodData Cloud. + + Args: + custom_datasets (list[CustomDatasetDefinition]): List of custom dataset definitions. + custom_fields (list[CustomFieldDefinition]): List of custom field definitions. + check_relations (bool): If True, checks for invalid relations in the workspace + after updating the LDM. If the number of invalid relations increases, + the LDM will be reverted to its previous state. If False, the check + is skiped and the LDM is updated directly. Defaults to True. + + Raises: + ValueError: If there are validation errors in the dataset or field definitions. + """ + # Validate raw data and aggregate the custom field and dataset + # definitions per workspace. + validated_data: dict[WorkspaceId, dict[DatasetId, CustomDataset]] = ( + self._validator.validate(custom_datasets, custom_fields) + ) + + if check_relations: + # Process the validated data with relations check. + self._process_with_relations_check(validated_data) + else: + self._process_without_relations_check(validated_data) diff --git a/gooddata-pipelines/gooddata_pipelines/ldm_extension/models/__init__.py b/gooddata-pipelines/gooddata_pipelines/ldm_extension/models/__init__.py new file mode 100644 index 000000000..37d863d60 --- /dev/null +++ b/gooddata-pipelines/gooddata_pipelines/ldm_extension/models/__init__.py @@ -0,0 +1 @@ +# (C) 2025 GoodData Corporation diff --git a/gooddata-pipelines/gooddata_pipelines/ldm_extension/models/aliases.py b/gooddata-pipelines/gooddata_pipelines/ldm_extension/models/aliases.py new file mode 100644 index 000000000..98e4ef793 --- /dev/null +++ b/gooddata-pipelines/gooddata_pipelines/ldm_extension/models/aliases.py @@ -0,0 +1,9 @@ +# (C) 2025 GoodData Corporation +"""This module defines type aliases intended to improve readability.""" + +from typing import TypeAlias + +WorkspaceId: TypeAlias = str +DatasetId: TypeAlias = str + +__all__ = ["WorkspaceId", "DatasetId"] diff --git a/gooddata-pipelines/gooddata_pipelines/ldm_extension/models/analytical_object.py b/gooddata-pipelines/gooddata_pipelines/ldm_extension/models/analytical_object.py new file mode 100644 index 000000000..fede882a7 --- /dev/null +++ b/gooddata-pipelines/gooddata_pipelines/ldm_extension/models/analytical_object.py @@ -0,0 +1,33 @@ +# (C) 2025 GoodData Corporation +"""This module defines the AnalyticalObjects Pydantic model. + +The model is used to represent features of analytical objects important for +checking the validity of references. +""" + +from pydantic import BaseModel, Field + + +class Attributes(BaseModel): + title: str + are_relations_valid: bool = Field(alias="areRelationsValid") + + +class AnalyticalObject(BaseModel): + id: str + type: str + attributes: Attributes + + +class AnalyticalObjects(BaseModel): + """Simplified model representing response obtained from GoodData API when querying + analytical objects. + + This model is used to represent analytical objects such as metrics, visualizations, + and dashboard in a simplified manner, with the purpose of checkinf the validity + of references of these objects. + + This is not a complete schema of the analytical objects! + """ + + data: list[AnalyticalObject] diff --git a/gooddata-pipelines/gooddata_pipelines/ldm_extension/models/custom_data_object.py b/gooddata-pipelines/gooddata_pipelines/ldm_extension/models/custom_data_object.py new file mode 100644 index 000000000..b241d5e34 --- /dev/null +++ b/gooddata-pipelines/gooddata_pipelines/ldm_extension/models/custom_data_object.py @@ -0,0 +1,90 @@ +# (C) 2025 GoodData Corporation +"""This module defines enums and models used to represent the input data. + +Models defined here are used to validate and structure the input data before +further processing. +""" + +from enum import Enum + +from pydantic import BaseModel, model_validator + + +class CustomFieldType(str, Enum): + """GoodData field types.""" + + # NOTE: Start using StrEnum with Python 3.11 + ATTRIBUTE = "attribute" + FACT = "fact" + DATE = "date" + + +class ColumnDataType(str, Enum): + """Supported data types""" + + # NOTE: Start using StrEnum with Python 3.11 + INT = "INT" + STRING = "STRING" + DATE = "DATE" + NUMERIC = "NUMERIC" + TIMESTAMP = "TIMESTAMP" + TIMESTAMP_TZ = "TIMESTAMP_TZ" + BOOLEAN = "BOOLEAN" + + +class CustomFieldDefinition(BaseModel): + """Input model for custom field definition.""" + + workspace_id: str + dataset_id: str + custom_field_id: str + custom_field_name: str + custom_field_type: CustomFieldType + custom_field_source_column: str + custom_field_source_column_data_type: ColumnDataType + + @model_validator(mode="after") + def check_ids_not_equal(self) -> "CustomFieldDefinition": + """Check that custom field ID is not the same as dataset ID.""" + if self.custom_field_id == self.dataset_id: + raise ValueError( + f"Custom field ID {self.custom_field_id} cannot be the same as dataset ID {self.dataset_id}" + ) + return self + + +class CustomDatasetDefinition(BaseModel): + """Input model for custom dataset definition.""" + + workspace_id: str + dataset_id: str + dataset_name: str + dataset_datasource_id: str + dataset_source_table: str | None + dataset_source_sql: str | None + parent_dataset_reference: str + parent_dataset_reference_attribute_id: str + dataset_reference_source_column: str + dataset_reference_source_column_data_type: ColumnDataType + workspace_data_filter_id: str + workspace_data_filter_column_name: str + + @model_validator(mode="after") + def check_source(self) -> "CustomDatasetDefinition": + """At least one of dataset_source_table or dataset_source_sql is provided.""" + if not (self.dataset_source_table or self.dataset_source_sql): + raise ValueError( + "One of dataset_source_table and dataset_source_sql must be provided" + ) + if self.dataset_source_table and self.dataset_source_sql: + raise ValueError( + "Only one of dataset_source_table and dataset_source_sql can be provided" + ) + return self + + +class CustomDataset(BaseModel): + """Custom dataset with its definition and custom fields.""" + + definition: CustomDatasetDefinition + custom_fields: list[CustomFieldDefinition] diff --git a/gooddata-pipelines/tests/data/custom_fields/response_get_all_dashboards.json b/gooddata-pipelines/tests/data/custom_fields/response_get_all_dashboards.json new file mode 100644 index 000000000..0b8839860 --- /dev/null +++ b/gooddata-pipelines/tests/data/custom_fields/response_get_all_dashboards.json @@ -0,0 +1,72 @@ +{ + "data": [ + { + "id": "dashboard_id_1", + "type": "analyticalDashboard", + "attributes": { + "title": "Custom Dashboard", + "areRelationsValid": true, + "content": { + "layout": { + "type": "IDashboardLayout", + "sections": [ + { + "type": "IDashboardLayoutSection", + "items": [ + { + "type": "IDashboardLayoutItem", + "size": { + "xl": { + "gridWidth": 12, + "gridHeight": 12 + } + }, + "widget": { + "insight": { + "identifier": { + "id": "visualization_id_1", + "type": "visualizationObject" + } + } + } + }, + { + "type": "IDashboardLayoutItem", + "size": { + "xl": { + "gridWidth": 12, + "gridHeight": 12 + } + }, + "widget": { + "insight": { + "identifier": { + "id": "visualization_id_2", + "type": "visualizationObject" + } + } + } + } + ] + } + ] + } + }, + "createdAt": "2025-06-17 13:13" + }, + "links": { + "self": "https://link-to-self.com" + }, + "meta": { + "origin": { + "originType": "NATIVE", + "originId": "workspace_id_1" + } + } + } + ], + "links": { + "self": "https://link-to-self.com", + "next": "https://link-to-next.com" + } +} diff --git a/gooddata-pipelines/tests/data/custom_fields/response_get_all_metrics.json b/gooddata-pipelines/tests/data/custom_fields/response_get_all_metrics.json new file mode 100644 index 000000000..74407eba7 --- /dev/null +++ b/gooddata-pipelines/tests/data/custom_fields/response_get_all_metrics.json @@ -0,0 +1,78 @@ +{ + "data": [ + { + "id": "metric_id_1", + "type": "metric", + "attributes": { + "title": "metric title 1", + "description": "", + "areRelationsValid": true, + "createdAt": "2025-06-12 13:26", + "content": { + "format": "#,##0.00", + "maql": "select AVG({fact/some_fact})" + } + }, + "links": { + "self": "https://link-to-self.com" + }, + "meta": { + "origin": { + "originType": "NATIVE", + "originId": "workspace_id_1" + } + } + }, + { + "id": "metric_id_2", + "type": "metric", + "attributes": { + "title": "metric title 2", + "description": "", + "areRelationsValid": true, + "createdAt": "2025-06-13 08:12", + "content": { + "format": "#,##0.00", + "maql": "select AVG({fact/some_other_fact})" + } + }, + "links": { + "self": "https://link-to-self.com" + }, + "meta": { + "origin": { + "originType": "NATIVE", + "originId": "workspace_id_1" + } + } + }, + { + "id": "metric_id_3", + "type": "metric", + "attributes": { + "title": "metric title 3", + "description": "", + "areRelationsValid": false, + "createdAt": "2025-06-13 08:12", + "modifiedAt": "2025-06-13 08:16", + "content": { + "format": "#,##0.00", + "maql": "SELECT SUM( {fact/some_fact}* {fact/some_other_fact} )" + } + }, + "links": { + "self": "https://link-to-self.com" + }, + "meta": { + "origin": { + "originType": "NATIVE", + "originId": "custom_field_child" + } + } + } + ], + "links": { + "self": "https://link-to-self.com", + "next": "https://link-to-self.com" + } +} diff --git a/gooddata-pipelines/tests/data/custom_fields/response_get_all_visualizations.json b/gooddata-pipelines/tests/data/custom_fields/response_get_all_visualizations.json new file mode 100644 index 000000000..e64c66acc --- /dev/null +++ b/gooddata-pipelines/tests/data/custom_fields/response_get_all_visualizations.json @@ -0,0 +1,143 @@ +{ + "data": [ + { + "id": "visualization_id_1", + "type": "visualizationObject", + "attributes": { + "title": "chart title 1", + "description": "", + "areRelationsValid": true, + "content": { + "buckets": [ + { + "items": [ + { + "measure": { + "localIdentifier": "f3be64be0d3a49019088462bfe87d31f", + "definition": { + "measureDefinition": { + "item": { + "identifier": { + "id": "metric_id_1", + "type": "metric" + } + }, + "filters": [] + } + }, + "title": "item title 1" + } + } + ], + "localIdentifier": "measures" + }, + { + "items": [ + { + "attribute": { + "localIdentifier": "a5a36aa84014410aaaa2f16ade7d3808", + "displayForm": { + "identifier": { "id": "attribute id", "type": "label" } + } + } + } + ], + "localIdentifier": "attribute" + } + ], + "filters": [], + "sorts": [ + { + "attributeSortItem": { + "attributeIdentifier": "a5a36aa84014410aaaa2f16ade7d3808", + "direction": "asc" + } + } + ], + "properties": {}, + "visualizationUrl": "local:table", + "version": "2" + }, + "createdAt": "2025-06-12 13:28" + }, + "links": { + "self": "http://link-to-self.com" + }, + "meta": { + "origin": { "originType": "NATIVE", "originId": "workspace_id_1" } + } + }, + { + "id": "visualization_id_2", + "type": "visualizationObject", + "attributes": { + "title": "chart title 2", + "description": "", + "areRelationsValid": true, + "content": { + "buckets": [ + { + "items": [ + { + "measure": { + "localIdentifier": "91afbe18dca94984bc0ebb42b6b9f814", + "definition": { + "measureDefinition": { + "item": { + "identifier": { + "id": "metric_id_2", + "type": "metric" + } + }, + "filters": [] + } + }, + "title": "item title 2" + } + } + ], + "localIdentifier": "measures" + }, + { + "items": [ + { + "attribute": { + "localIdentifier": "5c5a83f5b5194fed9d4de1170acf3fef", + "displayForm": { + "identifier": { "id": "attribute id_1", "type": "label" } + } + } + }, + { + "attribute": { + "localIdentifier": "f80144be70944ea0be6b12d130f8ef0e", + "displayForm": { + "identifier": { "id": "attribute id_2", "type": "label" } + } + } + } + ], + "localIdentifier": "view" + } + ], + "filters": [], + "sorts": [], + "properties": {}, + "visualizationUrl": "local:column", + "version": "2" + }, + "createdAt": "2025-06-16 21:12" + }, + "links": { + "self": "https://link-to-self.com" + }, + "meta": { + "origin": { "originType": "NATIVE", "originId": "workspace_id_1" } + } + } + ], + "links": { + "self": "https://link-to-self.com", + "next": "https://link-to-next.com" + } +} diff --git a/gooddata-pipelines/tests/test_ldm_extension/__init__.py b/gooddata-pipelines/tests/test_ldm_extension/__init__.py new file mode 100644 index 000000000..37d863d60 --- /dev/null +++ b/gooddata-pipelines/tests/test_ldm_extension/__init__.py @@ -0,0 +1 @@ +# (C) 2025 GoodData Corporation diff --git a/gooddata-pipelines/tests/test_ldm_extension/test_input_processor.py b/gooddata-pipelines/tests/test_ldm_extension/test_input_processor.py new file mode 100644 index 000000000..851e903fa --- /dev/null +++ b/gooddata-pipelines/tests/test_ldm_extension/test_input_processor.py @@ -0,0 +1,174 @@ +# (C) 2025 GoodData Corporation +import pytest + +from gooddata_pipelines.ldm_extension.input_processor import ( + LdmExtensionDataProcessor, +) +from gooddata_pipelines.ldm_extension.models.custom_data_object import ( + ColumnDataType, + CustomDataset, + CustomDatasetDefinition, + CustomFieldDefinition, + CustomFieldType, +) + + +@pytest.fixture +def mock_custom_field_attribute(): + return CustomFieldDefinition( + workspace_id="workspace1", + dataset_id="ds1", + custom_field_id="attr1", + custom_field_name="Attribute 1", + custom_field_type=CustomFieldType.ATTRIBUTE, + custom_field_source_column="col_attr1", + custom_field_source_column_data_type=ColumnDataType.STRING, + ) + + +@pytest.fixture +def mock_custom_field_fact(): + return CustomFieldDefinition( + workspace_id="workspace1", + dataset_id="ds1", + custom_field_id="fact1", + custom_field_name="Fact 1", + custom_field_type=CustomFieldType.FACT, + custom_field_source_column="col_fact1", + custom_field_source_column_data_type=ColumnDataType.INT, + ) + + +@pytest.fixture +def mock_custom_field_date(): + return CustomFieldDefinition( + workspace_id="workspace1", + dataset_id="ds1", + custom_field_id="date1", + custom_field_name="Date 1", + custom_field_type=CustomFieldType.DATE, + custom_field_source_column="col_date1", + custom_field_source_column_data_type=ColumnDataType.DATE, + ) + + +@pytest.fixture +def mock_dataset_definition(): + return CustomDatasetDefinition( + workspace_id="workspace1", + dataset_id="ds1", + dataset_name="Dataset 1", + dataset_source_table="table1", + dataset_datasource_id="ds_source", + dataset_source_sql=None, + parent_dataset_reference="parent_ds", + parent_dataset_reference_attribute_id="parent_attr", + dataset_reference_source_column="ref_col", + dataset_reference_source_column_data_type=ColumnDataType.STRING, + workspace_data_filter_id="wdf1", + workspace_data_filter_column_name="col1", + ) + + +@pytest.fixture +def mock_custom_dataset( + mock_dataset_definition, + mock_custom_field_attribute, + mock_custom_field_fact, + mock_custom_field_date, +): + return CustomDataset( + definition=mock_dataset_definition, + custom_fields=[ + mock_custom_field_attribute, + mock_custom_field_fact, + mock_custom_field_date, + ], + ) + + +def test_attribute_from_field(mock_custom_field_attribute): + attr = LdmExtensionDataProcessor._attribute_from_field( + "dataset_name", mock_custom_field_attribute + ) + assert attr.id == "attr1" + assert attr.title == "Attribute 1" + assert attr.source_column == "col_attr1" + assert attr.source_column_data_type == ColumnDataType.STRING.value + assert attr.tags == ["dataset_name"] + + +def test_fact_from_field(mock_custom_field_fact): + fact = LdmExtensionDataProcessor._fact_from_field( + "dataset_name", mock_custom_field_fact + ) + assert fact.id == "fact1" + assert fact.title == "Fact 1" + assert fact.source_column == "col_fact1" + assert fact.source_column_data_type == ColumnDataType.INT.value + assert fact.tags == ["dataset_name"] + + +def test_date_from_field(mock_custom_field_date): + processor = LdmExtensionDataProcessor() + date_ds = processor._date_from_field("dataset_name", mock_custom_field_date) + assert date_ds.id == "date1" + assert date_ds.title == "Date 1" + assert set(date_ds.granularities) == set(processor.DATE_GRANULARITIES) + assert date_ds.tags == ["dataset_name"] + + +def test_date_ref_from_field(mock_custom_field_date): + ref = LdmExtensionDataProcessor._date_ref_from_field(mock_custom_field_date) + assert ref.identifier.id == "date1" + assert ref.sources + assert ref.sources[0].column == "col_date1" + assert ref.sources[0].data_type == ColumnDataType.DATE.value + + +def test_get_sources_table_only(mock_dataset_definition): + mock_dataset_definition.dataset_source_sql = None + dataset = CustomDataset( + definition=mock_dataset_definition, custom_fields=[] + ) + table_id, sql = LdmExtensionDataProcessor._get_sources(dataset) + assert table_id is not None + assert table_id.id == "table1" + assert sql is None + + +def test_get_sources_sql_only(mock_dataset_definition): + mock_dataset_definition.dataset_source_table = None + mock_dataset_definition.dataset_source_sql = "SELECT * FROM foo" + dataset = CustomDataset( + definition=mock_dataset_definition, custom_fields=[] + ) + table_id, sql = LdmExtensionDataProcessor._get_sources(dataset) + assert table_id is None + assert sql is not None + assert sql.statement == "SELECT * FROM foo" + + +def test_datasets_to_ldm(mock_custom_dataset): + print(mock_custom_dataset) + processor = LdmExtensionDataProcessor() + datasets = {"ds1": mock_custom_dataset} + model = processor.datasets_to_ldm(datasets) + # Check that the model contains the expected dataset and date instance + ldm = model.ldm + assert ldm + assert len(ldm.datasets) == 1 + ds = ldm.datasets[0] + assert ds.id == "ds1" + assert ds.title == "Dataset 1" + assert ds.attributes + assert ds.facts + assert len(ds.attributes) == 1 + assert len(ds.facts) == 1 + assert len(ds.references) == 2 # 1 parent + 1 date + assert ds.workspace_data_filter_columns + assert ds.workspace_data_filter_references + assert ds.workspace_data_filter_columns[0].name == "col1" + assert ds.workspace_data_filter_references[0].filter_id.id == "wdf1" + assert len(ldm.date_instances) == 1 + assert ldm.date_instances[0].id == "date1" diff --git a/gooddata-pipelines/tests/test_ldm_extension/test_input_validator.py b/gooddata-pipelines/tests/test_ldm_extension/test_input_validator.py new file mode 100644 index 000000000..13401574a --- /dev/null +++ b/gooddata-pipelines/tests/test_ldm_extension/test_input_validator.py @@ -0,0 +1,165 @@ +# (C) 2025 GoodData Corporation +import pytest + +from gooddata_pipelines.ldm_extension.input_validator import ( + LdmExtensionDataValidator, +) +from gooddata_pipelines.ldm_extension.models.custom_data_object import ( + ColumnDataType, + CustomDataset, + CustomDatasetDefinition, + CustomFieldDefinition, + CustomFieldType, +) + + +@pytest.fixture +def valid_dataset_definitions(): + """Fixture to provide valid dataset definitions for testing.""" + return [ + CustomDatasetDefinition( + workspace_id="ws1", + dataset_id="ds1", + dataset_name="Dataset 1", + dataset_datasource_id="ds_source_1", + dataset_source_table="table1", + dataset_source_sql=None, + parent_dataset_reference="parent1", + parent_dataset_reference_attribute_id="parent1.id", + dataset_reference_source_column="id", + dataset_reference_source_column_data_type=ColumnDataType.STRING, + workspace_data_filter_id="wdf1", + workspace_data_filter_column_name="id", + ), + CustomDatasetDefinition( + workspace_id="ws2", + dataset_id="ds1", + dataset_name="Dataset 2", + dataset_datasource_id="ds_source_2", + dataset_source_table="table2", + dataset_source_sql=None, + parent_dataset_reference="parent2", + parent_dataset_reference_attribute_id="parent2.id", + dataset_reference_source_column="id", + dataset_reference_source_column_data_type=ColumnDataType.INT, + workspace_data_filter_id="wdf2", + workspace_data_filter_column_name="id", + ), + ] + + +@pytest.fixture +def valid_field_definitions(): + """Fixture to provide valid field definitions for testing.""" + return [ + CustomFieldDefinition( + workspace_id="ws1", + dataset_id="ds1", + custom_field_id="cf1", + custom_field_name="Field 1", + custom_field_type=CustomFieldType.ATTRIBUTE, + custom_field_source_column="col1", + custom_field_source_column_data_type=ColumnDataType.STRING, + ), + CustomFieldDefinition( + workspace_id="ws1", + dataset_id="ds1", + custom_field_id="cf2", + custom_field_name="Field 2", + custom_field_type=CustomFieldType.ATTRIBUTE, + custom_field_source_column="col2", + custom_field_source_column_data_type=ColumnDataType.STRING, + ), + CustomFieldDefinition( + workspace_id="ws2", + dataset_id="ds1", + custom_field_id="cf3", + custom_field_name="Field 3", + custom_field_type=CustomFieldType.ATTRIBUTE, + custom_field_source_column="col3", + custom_field_source_column_data_type=ColumnDataType.STRING, + ), + ] + + +def test_validate_success(valid_dataset_definitions, valid_field_definitions): + """Provide valid input data and expect successful validation.""" + validator = LdmExtensionDataValidator() + result = validator.validate( + valid_dataset_definitions, valid_field_definitions + ) + assert isinstance(result, dict) + assert "ws1" in result + assert "ds1" in result["ws1"] + assert isinstance(result["ws1"]["ds1"], CustomDataset) + assert len(result["ws1"]["ds1"].custom_fields) == 2 + assert result["ws2"]["ds1"].custom_fields[0].custom_field_id == "cf3" + + +def test_duplicate_dataset_raises(valid_dataset_definitions): + """Test that duplicate dataset definitions raise a ValueError.""" + # Add a duplicate dataset definition + invalid = valid_dataset_definitions + valid_dataset_definitions + validator = LdmExtensionDataValidator() + with pytest.raises(ValueError, match="Duplicate dataset definitions"): + validator.validate(invalid, []) + + +def test_duplicate_field_workspace_level(valid_dataset_definitions): + """Duplicate cf_id for ATTRIBUTE in same workspace. should raise ValueError.""" + fields = [ + CustomFieldDefinition( + workspace_id="ws1", + dataset_id="ds1", + custom_field_id="cf1", + custom_field_type=CustomFieldType.ATTRIBUTE, + custom_field_name="Field 1", + custom_field_source_column="col1", + custom_field_source_column_data_type=ColumnDataType.STRING, + ), + CustomFieldDefinition( + workspace_id="ws1", + dataset_id="ds2", + custom_field_id="cf1", + custom_field_type=CustomFieldType.ATTRIBUTE, + custom_field_name="Field 2", + custom_field_source_column="col2", + custom_field_source_column_data_type=ColumnDataType.STRING, + ), + ] + validator = LdmExtensionDataValidator() + with pytest.raises( + ValueError, + match="Duplicate custom field found for workspace ws1 with field ID cf1", + ): + validator.validate(valid_dataset_definitions, fields) + + +def test_duplicate_field_dataset_level(valid_dataset_definitions): + """Duplicate cf_id for DATE in same dataset. should raise ValueError.""" + fields = [ + CustomFieldDefinition( + workspace_id="ws1", + dataset_id="ds1", + custom_field_id="cf1", + custom_field_type=CustomFieldType.DATE, + custom_field_name="Field 1", + custom_field_source_column="col1", + custom_field_source_column_data_type=ColumnDataType.DATE, + ), + CustomFieldDefinition( + workspace_id="ws1", + dataset_id="ds1", + custom_field_id="cf1", + custom_field_type=CustomFieldType.DATE, + custom_field_name="Field 2", + custom_field_source_column="col2", + custom_field_source_column_data_type=ColumnDataType.DATE, + ), + ] + validator = LdmExtensionDataValidator() + with pytest.raises( + ValueError, + match="Duplicate custom field found for dataset ds1 with field ID cf1", + ): + validator.validate(valid_dataset_definitions, fields) diff --git a/gooddata-pipelines/tests/test_ldm_extension/test_ldm_extension_manager.py b/gooddata-pipelines/tests/test_ldm_extension/test_ldm_extension_manager.py new file mode 100644 index 000000000..5fc6cc087 --- /dev/null +++ b/gooddata-pipelines/tests/test_ldm_extension/test_ldm_extension_manager.py @@ -0,0 +1,194 @@ +# (C) 2025 GoodData Corporation +import pytest +from pytest_mock import MockerFixture + +from gooddata_pipelines.ldm_extension.ldm_extension_manager import ( + LdmExtensionManager, +) +from gooddata_pipelines.ldm_extension.models.analytical_object import ( + AnalyticalObject, + Attributes, +) + + +@pytest.fixture +def manager(mocker: MockerFixture, mock_logger): + custom_fields_manager = LdmExtensionManager(host="host", token="token") + + mocker.patch.object(custom_fields_manager, "_validator") + mocker.patch.object(custom_fields_manager, "_processor") + mocker.patch.object(custom_fields_manager, "_sdk") + mocker.patch.object(custom_fields_manager, "_api") + + custom_fields_manager.logger.subscribe(mock_logger) + + return custom_fields_manager + + +@pytest.fixture +def validated_data(mocker: MockerFixture): + # Minimal valid structure for validated_data + return {"workspace_1": {"dataset_1": mocker.MagicMock()}} + + +def make_analytical_object( + id, title="Title", type="type", are_relations_valid=True +): + obj = AnalyticalObject( + id=id, + type=type, + attributes=Attributes( + title=title, areRelationsValid=are_relations_valid + ), + ) + return obj + + +def test_relations_check_success( + manager, validated_data, mocker: MockerFixture +): + """Relation check passes, workspace layout not reverted.""" + # Setup mocks + mocker.patch.object( + manager._sdk.catalog_workspace, + "get_declarative_workspace", + return_value=mocker.MagicMock( + json=mocker.MagicMock(return_value="layout_json") + ), + ) + mocker.patch.object( + manager, + "_get_analytical_objects", + side_effect=[ + [make_analytical_object("a", "A")], # current + [make_analytical_object("a", "A")], # new + ], + ) + mocker.patch.object( + manager, + "_get_objects_with_invalid_relations", + side_effect=[ + set(), # current_invalid_relations + set(), # new_invalid_relations + ], + ) + mocker.patch.object( + manager._processor, "datasets_to_ldm", return_value="ldm" + ) + mocker.patch.object( + manager._sdk.catalog_workspace_content, "put_declarative_ldm" + ) + mocker.patch.object( + manager, "_new_ldm_does_not_invalidate_relations", return_value=True + ) + mocker.patch.object( + manager._sdk.catalog_workspace, "put_declarative_workspace" + ) + + # Should print "Workspace workspace_1 LDM updated." and not revert + manager._process_with_relations_check(validated_data) + manager._sdk.catalog_workspace_content.put_declarative_ldm.assert_called_once() + manager._sdk.catalog_workspace.put_declarative_workspace.assert_not_called() + + +def test_relations_check_failure_and_revert( + manager, validated_data, capsys, mocker: MockerFixture +): + """Relation check fails, workspace layout is reverted.""" + # Setup mocks + mocker.patch.object(manager._api, "get_workspace_layout") + obj1 = make_analytical_object("a", "A", "type", False) + obj2 = make_analytical_object("b", "B", "type", False) + mocker.patch.object( + manager, + "_get_objects_with_invalid_relations", + side_effect=[ + [obj1], # current_invalid_relations + [obj1, obj2], # new_invalid_relations (one more invalid) + ], + ) + mocker.patch.object( + manager._processor, "datasets_to_ldm", return_value="ldm" + ) + mocker.patch.object( + manager._sdk.catalog_workspace_content, "put_declarative_ldm" + ) + mocker.patch.object( + manager, "_new_ldm_does_not_invalidate_relations", return_value=False + ) + mocker.patch.object( + manager._sdk.catalog_workspace, "put_declarative_workspace" + ) + + manager._process_with_relations_check(validated_data) + + # Should revert and print info about invalid relations + manager._sdk.catalog_workspace.put_declarative_workspace.assert_called_once() + out = capsys.readouterr().out + assert ( + "Difference in invalid relations found in workspace workspace_1." in out + ) + assert "b (type) B" in out + assert "Reverting the workspace layout to the original state." in out + + +def test_relations_check_fewer_invalid_relations( + manager, validated_data, mocker: MockerFixture +): + """Fewer invalid relations after LDM update, no revert needed.""" + # Setup mocks + obj1 = make_analytical_object("a", "A", "type", False) + mocker.patch.object( + manager._sdk.catalog_workspace, + "get_declarative_workspace", + return_value=mocker.MagicMock( + json=mocker.MagicMock(return_value="layout_json") + ), + ) + mocker.patch.object( + manager, + "_get_objects_with_invalid_relations", + side_effect=[ + [ + obj1, + make_analytical_object("b", "B", "type", False), + ], # current_invalid_relations + [obj1], # new_invalid_relations (fewer) + ], + ) + mocker.patch.object( + manager._processor, "datasets_to_ldm", return_value="ldm" + ) + mocker.patch.object( + manager._sdk.catalog_workspace_content, "put_declarative_ldm" + ) + mocker.patch.object( + manager, "_new_ldm_does_not_invalidate_relations", return_value=True + ) + mocker.patch.object( + manager._sdk.catalog_workspace, "put_declarative_workspace" + ) + + manager._process_with_relations_check(validated_data) + manager._sdk.catalog_workspace.put_declarative_workspace.assert_not_called() + + +def test_log_diff_invalid_relations(manager, capsys): + """Log diff invalid relations.""" + manager._log_diff_invalid_relations( + [ + make_analytical_object("a", "A", "type", False), + make_analytical_object("c", "C", "type", False), + ], + [ + make_analytical_object("b", "B", "type", False), + make_analytical_object("c", "C", "type", False), + make_analytical_object("d", "D", "type", False), + ], + ) + + captured_output = capsys.readouterr().out + + assert "b (type) B" in captured_output + assert "d (type) D" in captured_output + assert "c (type) C" not in captured_output diff --git a/gooddata-pipelines/tests/test_ldm_extension/test_models/__init__.py b/gooddata-pipelines/tests/test_ldm_extension/test_models/__init__.py new file mode 100644 index 000000000..37d863d60 --- /dev/null +++ b/gooddata-pipelines/tests/test_ldm_extension/test_models/__init__.py @@ -0,0 +1 @@ +# (C) 2025 GoodData Corporation diff --git a/gooddata-pipelines/tests/test_ldm_extension/test_models/test_analytical_object.py b/gooddata-pipelines/tests/test_ldm_extension/test_models/test_analytical_object.py new file mode 100644 index 000000000..6665d8fa0 --- /dev/null +++ b/gooddata-pipelines/tests/test_ldm_extension/test_models/test_analytical_object.py @@ -0,0 +1,66 @@ +# (C) 2025 GoodData Corporation +import json + +import pytest + +from gooddata_pipelines.ldm_extension.models.analytical_object import ( + AnalyticalObject, + AnalyticalObjects, +) +from tests.conftest import TEST_DATA_DIR + + +@pytest.mark.parametrize( + "file_path", + [ + f"{TEST_DATA_DIR}/custom_fields/response_get_all_metrics.json", + f"{TEST_DATA_DIR}/custom_fields/response_get_all_visualizations.json", + f"{TEST_DATA_DIR}/custom_fields/response_get_all_dashboards.json", + ], +) +def test_analytical_object_model_with_metrics(file_path): + with open(file_path, "r") as file: + data = json.load(file) + analytical_objects = AnalyticalObjects(**data) + assert isinstance(analytical_objects, AnalyticalObjects) + assert isinstance(analytical_objects.data, list) + assert all( + isinstance(obj, AnalyticalObject) for obj in analytical_objects.data + ) + + +@pytest.mark.parametrize( + "response", + [ + { + "something": "unexpected", + }, + { + "data": [ + { + # "id": "metric1", # Missing id field + "type": "metric", + "attributes": { + "title": "Test Metric", + "areRelationsValid": True, + }, + } + ] + }, + { + "data": [ + { + "id": 123, # invalid id type + "type": "metric", + "attributes": { + "title": "Test Metric", + "areRelationsValid": True, + }, + } + ] + }, + ], +) +def test_analytical_object_model_with_invalid_response(response): + with pytest.raises(ValueError): + AnalyticalObjects(**response) diff --git a/gooddata-pipelines/tests/test_ldm_extension/test_models/test_custom_data_object.py b/gooddata-pipelines/tests/test_ldm_extension/test_models/test_custom_data_object.py new file mode 100644 index 000000000..f0c605b15 --- /dev/null +++ b/gooddata-pipelines/tests/test_ldm_extension/test_models/test_custom_data_object.py @@ -0,0 +1,102 @@ +# (C) 2025 GoodData Corporation +import pytest +from pydantic import ValidationError + +from gooddata_pipelines.ldm_extension.models.custom_data_object import ( + ColumnDataType, + CustomDataset, + CustomDatasetDefinition, + CustomFieldDefinition, + CustomFieldType, +) + + +def make_valid_field_def(**kwargs): + data = { + "workspace_id": "ws1", + "dataset_id": "ds1", + "custom_field_id": "cf1", + "custom_field_name": "Custom Field", + "custom_field_type": CustomFieldType.ATTRIBUTE, + "custom_field_source_column": "col1", + "custom_field_source_column_data_type": ColumnDataType.STRING, + } + data.update(kwargs) + return data + + +def make_valid_dataset_def(**kwargs): + data = { + "workspace_id": "ws1", + "dataset_id": "ds1", + "dataset_name": "Dataset", + "dataset_datasource_id": "dsrc1", + "dataset_source_table": "table1", + "dataset_source_sql": None, + "parent_dataset_reference": "parent_ds", + "parent_dataset_reference_attribute_id": "parent_attr", + "dataset_reference_source_column": "src_col", + "dataset_reference_source_column_data_type": ColumnDataType.STRING, + "workspace_data_filter_id": "wdf1", + "workspace_data_filter_column_name": "col1", + } + data.update(kwargs) + return data + + +def test_custom_field_definition_valid(): + field = CustomFieldDefinition(**make_valid_field_def()) + assert field.custom_field_id == "cf1" + assert field.custom_field_type == CustomFieldType.ATTRIBUTE + + +def test_custom_field_definition_cf_id_equals_dataset_id_raises(): + data = make_valid_field_def(custom_field_id="ds1") + with pytest.raises(ValidationError) as exc: + CustomFieldDefinition(**data) + assert "cannot be the same as dataset ID" in str(exc.value) + + +def test_custom_dataset_definition_valid_table(): + ds = CustomDatasetDefinition(**make_valid_dataset_def()) + assert ds.dataset_source_table == "table1" + assert ds.dataset_source_sql is None + + +def test_custom_dataset_definition_valid_sql(): + data = make_valid_dataset_def( + dataset_source_table=None, dataset_source_sql="SELECT 1" + ) + ds = CustomDatasetDefinition(**data) + assert ds.dataset_source_sql == "SELECT 1" + assert ds.dataset_source_table is None + + +def test_custom_dataset_definition_both_none_raises(): + data = make_valid_dataset_def( + dataset_source_table=None, dataset_source_sql=None + ) + with pytest.raises(ValidationError) as exc: + CustomDatasetDefinition(**data) + assert "must be provided" in str(exc.value) + + +def test_custom_dataset_definition_both_provided_raises(): + data = make_valid_dataset_def( + dataset_source_table="table1", dataset_source_sql="SELECT 1" + ) + with pytest.raises(ValidationError) as exc: + CustomDatasetDefinition(**data) + assert ( + "Only one of dataset_source_table and dataset_source_sql can be provided" + in str(exc.value) + ) + + +def test_custom_dataset_model(): + ds_def = CustomDatasetDefinition(**make_valid_dataset_def()) + field_def = CustomFieldDefinition(**make_valid_field_def()) + dataset = CustomDataset(definition=ds_def, custom_fields=[field_def]) + assert dataset.definition.dataset_id == "ds1" + assert len(dataset.custom_fields) == 1 + assert dataset.custom_fields[0].custom_field_id == "cf1"