# Transforming lexical-cloud-data

We'll consolidate all of the JSON from lexical-cloud-data folders into two entities:
  1. Taxonomy
  1. Products
  
This function will help us locate those files:

In [34]:
import glob
import os

def list_filenames(rel_path: str):
    return glob.glob(os.path.join(os.getcwd(),"data/lexical-cloud-data/%s"%rel_path,"*.json"))

## Taxonomy

JSON exist in these folders for Taxonomy:
```
providers/*.json
services/*.json
domains/*.json
categories/*.json
features/*.json
labels/*.json
```
Each json entry has a matching folder with the following
```
{taxonomy}/{entry}/products.json
{taxonomy}/{entry}/relations.json
```

Entries in each taxonomy folder have a `type` attribute matching the singular form of the name.
The nested folder's products.json is duplicate information, so we won't be using it.
The relations.json has relationships for the taxonomy entry to products and other taxonomy entries, so it will be merged into taxonomy entitiy in a `relations` attribute. This attribute is more complete than `definition` which we'll drop.


In [24]:
import json
import re

def create_taxonomy_record(filename: str):
    result = None
    with open(filename, 'r') as f:
        result = json.load(f)
    if result is not None:
        if "definition" in result:
            del result["definition"]
        with open(re.sub(r"\.json$","/relations.json",filename), 'r') as f:
            result["relations"] = json.load(f)
    return result

In [25]:
providers = [create_taxonomy_record(f) for f in list_filenames("providers")]
services = [create_taxonomy_record(f) for f in list_filenames("services")]
domains = [create_taxonomy_record(f) for f in list_filenames("domains")]
categories = [create_taxonomy_record(f) for f in list_filenames("categories")]
features = [create_taxonomy_record(f) for f in list_filenames("features")]

Resulting in taxonomy records like:

In [27]:
print(json.dumps(categories[0], indent=4))

{
    "id": "access-control",
    "name": "access control",
    "type": "category",
    "links": {
        "self": "/categories/access-control"
    },
    "relations": {
        "instance": [
            "/products/identity/aws/iam",
            "/products/governance/aws/service-control-policies",
            "/products/identity/azure/rbac",
            "/products/identity/gcp/iam"
        ],
        "intersect": {
            "domains": [
                "identity management"
            ],
            "services": [
                "identity"
            ]
        },
        "symdiff": {
            "categories": [
                "identity provider"
            ],
            "domains": [
                "systems management"
            ],
            "providers": [
                "aws",
                "azure",
                "gcp"
            ],
            "services": [
                "governance"
            ]
        }
    }
}


## Product

The product entity combined JSON entries under the products directory:

```
products/{service}/{provider}/*.json
products/{service}/{provider}/*/models/*.json
products/{service}/{provider}/*/components/*.json
```

Adding a type attribute allows us to distinguish these later. Also, the name of an entry will need to append the name for model and component entries. This will keep these records consistent.

In [None]:
def create_product_record(filename: str, tier: str):
    result = None
    with open(filename, 'r') as f:
        result = json.load(f)
    if result is not None:
        result["type"] = tier
        if tier in result and tier in ["component","model"]:
            result["name"] += " - %s"%result[tier]
            del result[tier]
    return result

In [14]:
products = [create_product_record(f,"product") for f in list_filenames("products/*/*")]
product_components = [create_product_record(f,"component") for f in list_filenames("products/*/*/*/components")]
product_models = [create_product_record(f,"model") for f in list_filenames("products/*/*/*/models")]

Resulting in similar records for products, components and models:

In [28]:
print(json.dumps(products[0], indent=4))

{
    "name": "Amaxon Lex",
    "providers": [
        "aws"
    ],
    "services": [
        "ai"
    ],
    "domains": [
        "machine learning",
        "serverless"
    ],
    "categories": [
        "language processing"
    ],
    "features": [
        "conversational interface"
    ],
    "links": {
        "self": "/products/ai/aws/lex"
    },
    "type": "product"
}


In [29]:
print(json.dumps(product_components[0], indent=4))

{
    "name": "Google Cloud Vertex AI - AutoML",
    "providers": [
        "gcp"
    ],
    "services": [
        "ai"
    ],
    "domains": [
        "machine learning",
        "managed service"
    ],
    "categories": [
        "no-code",
        "model training"
    ],
    "links": {
        "self": "/products/ai/gcp/vertex-ai/components/automl",
        "parent": "/products/ai/gcp/vertex-ai"
    },
    "type": "component"
}


In [30]:
print(json.dumps(product_models[0], indent=4))

{
    "name": "Google App Engine - Flexible environment",
    "providers": [
        "gcp"
    ],
    "services": [
        "compute"
    ],
    "domains": [
        "managed service"
    ],
    "categories": [
        "paas"
    ],
    "features": [
        "container-based"
    ],
    "links": {
        "self": "/products/compute/gcp/app-engine/models/flexible",
        "parent": "/products/compute/gcp/app-engine"
    },
    "type": "model"
}


At this point, we know that all our data is valid json. We can export this data:

In [36]:
# TODO - dump json to disk

## Analysis

TODO