* Overview: Purpose of this notebook is to analyze the sample data downloaded from the Subsalt portal and to understand the data structure and the features. There are some findings being shared in this notebook.

In [1]:
# load json file into pandas dataframe
import pandas as pd
import json
import warnings

warnings.filterwarnings("ignore")

json_data = open("../data/sample_data_1113.json").read()
data = json.loads(json_data)
df = pd.DataFrame(data)
df.head()

Unnamed: 0,metadata,schema,genconfig,privacy,quality
0,"{'runtime_end': '2024-11-12T08:47:19.855989', ...","[{'name': 'agency_abbr', 'type': 'string', 'sy...","{'type': 'SubsaltTVAE', 'epochs': '100', 'lear...","[{'name': 'Minimum row count', 'threshold': 30...",[]
1,"{'runtime_end': '2024-11-12T08:47:24.279947', ...","[{'name': 'year', 'type': 'integer', 'synthesi...","{'type': 'SubsaltTVAE', 'epochs': '100', 'lear...",[{'name': 'Check distance distributions betwee...,[]
2,"{'runtime_end': '2024-11-12T08:47:35.348316', ...","[{'name': 'agency_abbr', 'type': 'string', 'sy...","{'type': 'SubsaltTVAE', 'epochs': '100', 'lear...","[{'name': 'Minimum row count', 'threshold': 30...",[]
3,"{'runtime_end': '2024-11-12T08:47:48.605414', ...","[{'name': 'year', 'type': 'integer', 'synthesi...","{'type': 'SubsaltCopulaGAN', 'epochs': '100', ...","[{'name': 'Minimum row count', 'threshold': 30...",[]
4,"{'runtime_end': '2024-11-12T08:47:57.323314', ...","[{'name': 'agency_abbr', 'type': 'string', 'sy...","{'type': 'SubsaltCopulaGAN', 'epochs': '100', ...","[{'name': 'Minimum row count', 'threshold': 30...",[]


- ✅  We are able to see four categories of data in the given dataset: `metadata`, `schema`, `genconfig`, `privacy`. And the `quality` being empty is expected according to our focus.

In [2]:
# expand column of dictionaries into separate columns
df = pd.concat(
    [df.drop(["genconfig"], axis=1), df["genconfig"].apply(pd.Series)], axis=1
)
df.shape

(62, 8)

In [3]:
df.head()

Unnamed: 0,metadata,schema,privacy,quality,type,epochs,learning_rate,batch_size
0,"{'runtime_end': '2024-11-12T08:47:19.855989', ...","[{'name': 'agency_abbr', 'type': 'string', 'sy...","[{'name': 'Minimum row count', 'threshold': 30...",[],SubsaltTVAE,100,0.0002,50000
1,"{'runtime_end': '2024-11-12T08:47:24.279947', ...","[{'name': 'year', 'type': 'integer', 'synthesi...",[{'name': 'Check distance distributions betwee...,[],SubsaltTVAE,100,0.0002,50000
2,"{'runtime_end': '2024-11-12T08:47:35.348316', ...","[{'name': 'agency_abbr', 'type': 'string', 'sy...","[{'name': 'Minimum row count', 'threshold': 30...",[],SubsaltTVAE,100,0.0002,50000
3,"{'runtime_end': '2024-11-12T08:47:48.605414', ...","[{'name': 'year', 'type': 'integer', 'synthesi...","[{'name': 'Minimum row count', 'threshold': 30...",[],SubsaltCopulaGAN,100,0.0002,50000
4,"{'runtime_end': '2024-11-12T08:47:57.323314', ...","[{'name': 'agency_abbr', 'type': 'string', 'sy...","[{'name': 'Minimum row count', 'threshold': 30...",[],SubsaltCopulaGAN,100,0.0002,50000


- ✅  We can see `type`, `epoches`, `learning_rate`, `batch_size` in the `genconfig` data. These are the hyperparameters used in the training of the model.

In [4]:
df = pd.concat([df.drop(["metadata"], axis=1), df["metadata"].apply(pd.Series)], axis=1)
df.head()

Unnamed: 0,schema,privacy,quality,type,epochs,learning_rate,batch_size,runtime_end,row_count,product_version,table_names
0,"[{'name': 'agency_abbr', 'type': 'string', 'sy...","[{'name': 'Minimum row count', 'threshold': 30...",[],SubsaltTVAE,100,0.0002,50000,2024-11-12T08:47:19.855989,25000,v0.26.11,applications
1,"[{'name': 'year', 'type': 'integer', 'synthesi...",[{'name': 'Check distance distributions betwee...,[],SubsaltTVAE,100,0.0002,50000,2024-11-12T08:47:24.279947,5000,v0.26.11,credit_card_sample
2,"[{'name': 'agency_abbr', 'type': 'string', 'sy...","[{'name': 'Minimum row count', 'threshold': 30...",[],SubsaltTVAE,100,0.0002,50000,2024-11-12T08:47:35.348316,25000,v0.26.11,applications
3,"[{'name': 'year', 'type': 'integer', 'synthesi...","[{'name': 'Minimum row count', 'threshold': 30...",[],SubsaltCopulaGAN,100,0.0002,50000,2024-11-12T08:47:48.605414,5000,v0.26.11,credit_card_sample
4,"[{'name': 'agency_abbr', 'type': 'string', 'sy...","[{'name': 'Minimum row count', 'threshold': 30...",[],SubsaltCopulaGAN,100,0.0002,50000,2024-11-12T08:47:57.323314,25000,v0.26.11,applications


- ✅  We can see `runtime_end`, `row_count`, `product_version` and `table_names` in the `metadata` data. These are the metadata information of the dataset.

In [5]:
# concantenate all df["schema"] into a single dataframe
df_schema = pd.concat([pd.DataFrame(df["schema"][i]) for i in range(len(df["schema"]))])
df_schema.sample(10)

Unnamed: 0,name,type,synthesize_as,indirect_identifier,direct_identifier,null_ratio,unique_values,min_value,max_value
9,hud_median_family_income,float,int,True,False,0.00492,239.0,37900.0,114200.0
1,month,integer,int,True,False,0.0,12.0,1.0,12.0
19,student_performance_StudentPerformanceFactors_...,integer,int,True,False,,45.0,55.0,101.0
1,month,integer,int,True,False,0.0,12.0,1.0,12.0
14,test_db_healthcare_dataset_Weight,float,float,False,False,,751.0,45.0,120.0
12,test_db_healthcare_dataset_RespiratoryRate,integer,categorical,False,False,,9.0,12.0,20.0
8,RespiratoryRate,integer,int,False,False,0.0,9.0,12.0,20.0
7,health,string,categorical,False,False,0.0,3.0,,
6,test_db_healthcare_dataset_HeartRate,integer,int,False,False,,41.0,60.0,100.0
10,test_db_healthcare_dataset_OxygenSaturation,integer,int,False,False,,11.0,90.0,100.0


- ✅  We can see `name`, `type`, `indirect_identifier`, `direct_identifier`,`null_ratio`, `unique_values`,`min_value`,`max_value` and `synthesize_as` in the `schema` data. These are the schema information of the dataset.

In [6]:
df_schema.describe()

Unnamed: 0,null_ratio,unique_values,min_value,max_value
count,318.0,809.0,488.0,488.0
mean,0.000402,2092.792336,1138.381455,36585.3
std,0.001298,18792.968992,5934.824897,142700.1
min,0.0,2.0,-100.0,1.0
25%,0.0,4.0,1.0,12.0
50%,0.0,12.0,4.0,100.0
75%,0.0,131.0,60.0,850.0
max,0.00492,218133.0,37900.0,1490400.0


- ✅  We don't see -1 in the `null_ratio` and `unique_values` columns in the `schema` data. 

In [7]:
df_schema.info()

<class 'pandas.core.frame.DataFrame'>
Index: 871 entries, 0 to 22
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   name                 871 non-null    object 
 1   type                 871 non-null    object 
 2   synthesize_as        809 non-null    object 
 3   indirect_identifier  871 non-null    bool   
 4   direct_identifier    871 non-null    bool   
 5   null_ratio           318 non-null    float64
 6   unique_values        809 non-null    float64
 7   min_value            488 non-null    float64
 8   max_value            488 non-null    float64
dtypes: bool(2), float64(4), object(3)
memory usage: 56.1+ KB


In [8]:
df_schema["synthesize_as"].value_counts()

synthesize_as
categorical    363
int            285
float           92
binary          51
datetime        18
Name: count, dtype: int64

In [9]:
# check the NaN values ratio
df_schema.isnull().sum() / len(df_schema)

name                   0.000000
type                   0.000000
synthesize_as          0.071183
indirect_identifier    0.000000
direct_identifier      0.000000
null_ratio             0.634902
unique_values          0.071183
min_value              0.439724
max_value              0.439724
dtype: float64

> ⚠️ We can see the NaN values ratio pretty high in the `null_ratio` column in the `schema` data. 

> The NaN values ratio of `min_value` and `max_value` columns high as well in the `schema` data, it might related to the non-numeric data numbers are pretty high in the dataset.

In [10]:
# concantenate all df["privacy"] into a single dataframe
df_privacy = pd.concat(
    [pd.DataFrame(df["privacy"][i]) for i in range(len(df["privacy"]))]
)
df_privacy

Unnamed: 0,name,threshold,score,passed
0,Minimum row count,3000.00,25000.000000,True
1,Check distance distributions between real and ...,0.05,0.195684,True
2,Membership inference,0.55,0.467075,True
3,Risky row counts,237.50,1.000000,True
4,Row memorization,0.05,0.000000,True
...,...,...,...,...
2,Row memorization,0.05,0.000000,True
3,Membership inference,0.55,0.502009,True
4,Risky row counts,1775.76,110.000000,True
5,No new categorical values,0.00,0.000000,True


In [11]:
df_privacy["name"].unique()

array(['Minimum row count',
       'Check distance distributions between real and synthetic',
       'Membership inference', 'Risky row counts', 'Row memorization',
       'Attribute inference', 'No new categorical values'], dtype=object)

In [12]:
df_privacy[df_privacy["passed"] == False].value_counts()

Series([], Name: count, dtype: int64)

- ❓ No privacy test failed in the dataset. Is this expected? 

* We are now looking into three tests: `Risky row counts`, `Check distance distributions between real and synthetic`, `Row memorization`. These tests failed in the first sample dataset. 

In [13]:
df_privacy[df_privacy["name"] == "Row memorization"].value_counts()

name              threshold  score     passed
Row memorization  0.05       0.000000  True      59
                             0.001221  True       1
                             0.001347  True       1
                             0.011958  True       1
Name: count, dtype: int64

In [14]:
df_privacy[
    df_privacy["name"] == "Check distance distributions between real and synthetic"
].value_counts()

name                                                     threshold  score     passed
Check distance distributions between real and synthetic  0.05       0.068371  True      1
                                                                    0.616561  True      1
                                                                    0.482407  True      1
                                                                    0.492755  True      1
                                                                    0.500905  True      1
                                                                                       ..
                                                                    0.425455  True      1
                                                                    0.436950  True      1
                                                                    0.448084  True      1
                                                                    0.465079  True      1
               

In [15]:
df_privacy[df_privacy["name"] == "Risky row counts"].value_counts()

name              threshold  score  passed
Risky row counts  47.50      0.0    True      12
                  237.50     0.0    True       8
                  62.76      0.0    True       6
                  4001.79    0.0    True       6
                  123.12     0.0    True       6
                  4001.78    0.0    True       3
                  237.50     2.0    True       3
                  4001.62    0.0    True       3
                  237.50     6.0    True       2
                  142.49     0.0    True       2
                  463.99     1.0    True       1
                  1775.76    110.0  True       1
                             17.0   True       1
                  463.99     3.0    True       1
                  142.49     2.0    True       1
                  463.99     0.0    True       1
                  237.50     28.0   True       1
                             4.0    True       1
                             3.0    True       1
                          

- ❓ The privacy test score is not all zero any more, should we use the score as indicator of the privacy test results? And what is the logic behind the privacy test score, threshold and the results?

In [16]:
df_schema["indirect_identifier"].value_counts()

indirect_identifier
True     481
False    390
Name: count, dtype: int64

- ✅ We are able to see `indirect_identifier` with both `True` and `False` values.