## Exploratory Analysis for lexical-cloud-data's products

We've already transformed the product data from lexical-cloud-data at `data/products.json`

In [1]:
import pandas as pd

In [2]:
products_df = pd.read_json("data/products.json")
products_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 312 entries, 0 to 311
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   name        312 non-null    object
 1   providers   311 non-null    object
 2   services    311 non-null    object
 3   domains     312 non-null    object
 4   categories  309 non-null    object
 5   features    84 non-null     object
 6   links       312 non-null    object
 7   type        312 non-null    object
 8   labels      4 non-null      object
dtypes: object(9)
memory usage: 22.1+ KB


#### How are taxonomy entries represented by product type?

We know there are three types of product records

In [3]:
products_df = products_df.drop(columns=['links'])
products_df.groupby('type').count()

Unnamed: 0_level_0,name,providers,services,domains,categories,features,labels
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
component,18,17,17,18,17,7,1
model,15,15,15,15,15,5,0
product,279,279,279,279,277,72,3


Notice component and model have less providers than records with name.

#### How are taxonomy represented on actual products?

In [4]:
products_product_df = products_df.loc[products_df['type'] == "product"].explode('providers')
products_product_df.groupby('providers').count()

Unnamed: 0_level_0,name,services,domains,categories,features,type,labels
providers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
aws,118,118,118,117,22,118,2
azure,92,92,92,92,28,92,0
gcp,66,66,66,66,21,66,1
github,3,3,3,2,1,3,0


In [5]:
products_product_df.filter(['providers','services']).explode('services') \
    .groupby(['services','providers']).size().reset_index(name='count') \
    .pivot('services','providers','count').fillna(0, downcast='infer')

providers,aws,azure,gcp,github
services,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ai,4,7,7,1
analytics,7,5,5,0
compute,11,8,8,0
database,9,9,6,0
developer tools,13,9,2,3
framework,3,2,2,0
governance,16,13,6,0
hybrid,5,2,1,0
identity,3,2,2,0
integration,10,9,8,0
