## Exploratory Analysis for lexical-cloud-data

We've already transformed the data from lexical-cloud-data. It is consolidated at:
  1. data/products.json
  1. data/taxonomy.json

In [6]:
import pandas as pd

In [7]:
products_df = pd.read_json("data/products.json")
products_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277 entries, 0 to 276
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        277 non-null    object
 1   providers   277 non-null    object
 2   services    277 non-null    object
 3   domains     277 non-null    object
 4   categories  277 non-null    object
 5   features    66 non-null     object
 6   links       277 non-null    object
 7   type        277 non-null    object
 8   labels      3 non-null      object
dtypes: object(9)
memory usage: 19.6+ KB


#### How are taxonomy entries represented by product type?

We know there are three types of product records

In [8]:
products_df = products_df.drop(columns=['links'])
products_df.groupby('type').count()

Unnamed: 0_level_0,name,providers,services,domains,categories,features,labels
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
component,3,3,3,3,3,1,0
model,13,13,13,13,13,3,0
product,261,261,261,261,261,62,3


Notice component and model have less providers than records with name.

#### How are taxonomy represented on actual products?

In [9]:
products_product_df = products_df.loc[products_df['type'] == "product"].explode('providers')
products_product_df.groupby('providers').count()

Unnamed: 0_level_0,name,services,domains,categories,features,type,labels
providers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
aws,108,108,108,108,19,108,2
azure,89,89,89,89,25,89,0
gcp,62,62,62,62,17,62,1
github,2,2,2,2,1,2,0


In [10]:
products_product_df.filter(['providers','services']).explode('services') \
    .groupby(['services','providers']).size().reset_index(name='count') \
    .pivot('services','providers','count').fillna(0, downcast='infer')

providers,aws,azure,gcp,github
services,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ai,4,7,7,0
analytics,3,4,3,0
compute,11,8,8,0
database,8,9,5,0
developer tools,12,9,2,2
framework,3,2,2,0
governance,14,12,5,0
hybrid,5,2,1,0
identity,3,2,2,0
integration,6,8,7,0
