Try to build a very simple generator executable that works with the tapas library

In [45]:
import tapas.datasets
import tapas.generators
import numpy as np 
import pandas as pd

first we need to create an instance of the TabularDataset class

In [46]:
rng = np.random.default_rng(3)
n = 2000
wage = rng.random(n)
identifier = np.arange(n)

In [47]:
raw_data = np.stack([identifier, wage], axis=1)
df_data = pd.DataFrame(raw_data, columns=["id", "wage"])
df_data["id"] = df_data["id"].astype(int)
df_data.head()

Unnamed: 0,id,wage
0,0,0.085649
1,1,0.236811
2,2,0.801274
3,3,0.582162
4,4,0.094129


In [48]:
df_data["wage"].unique()

array([0.08564917, 0.23681051, 0.80127447, ..., 0.50198532, 0.17647698,
       0.43108475])

In [49]:
data_schema = [
    {"name": "id", "type": "finite", "representation": df_data["id"].unique()},
    {"name": "wage", "type": "finite", "representation": df_data["wage"].unique()}
]

In [50]:
data_description = tapas.datasets.DataDescription(schema=data_schema)

In [51]:
data = tapas.datasets.TabularDataset(data=df_data, description=data_description)

In [52]:
data.data.head()

Unnamed: 0,id,wage
0,0,0.085649
1,1,0.236811
2,2,0.801274
3,3,0.582162
4,4,0.094129


this is our "original" dataset; now we want to load a generator that takes the original dataset and returns synthetic data based on it (?)

## The generator

We make a generator in the file `src/myexecutable.py`, and need to make it executable with `chmod +x src/myexecutable.py`. 
Key features of the generator:
- it has the same structure as the original data set 
- it prints out the generated data set to the console, and the console output is read back into the python by tapas 


Then we can instantiate the generator.

In [53]:
generator = tapas.generators.GeneratorFromExecutable(exe="src/myexecutable.py")

In [54]:
generator.fit(data)

In [56]:
synthetic_data = generator.generate(3)

  if col_repr == 'integer':
  elif col_repr == 'number':
  elif col_repr == 'string':
  elif col_repr == 'date' or col_repr == 'datetime':
  if col["representation"] == "date" or col["representation"] == "datetime"


In [57]:
synthetic_data.data.head()

Unnamed: 0,id,wage
0,0.0,0.682352
1,1.0,0.053821
2,2.0,0.22036
