**Fine-tuning a synthetic data model to generate synthetic data based on a 311 call center dataset using gretel_synthetics ACTGAN**

This notebook is an example on how to fine-tune Gretel's open source synthetic data models to produce synthetic data that retains the statistical properties of our original data.

In this example, we use [ACTGAN](https://docs.gretel.ai/create-synthetic-data/models/synthetics/gretel-actgan), a generative adversarial network which is best suited for tabular, structured numerical, high column count data.

Steps
1. Load the training data
2. Create and configure model
3. Train model on training data
4. Sample synthetic data from the model
5. Evaluate the quality of the synthetic output

In [7]:
# Import Pandas for importing and working with DataFrames
import pandas as pd
# Import ACTGAN from gretel_synthetics
from gretel_synthetics.actgan import ACTGAN

In [8]:
# Load the training dataset as a Pandas DataFrame
train_df = pd.read_csv("./data/311_call_center_10k.csv")
train_df.head()

Unnamed: 0,CaseID,CreationTimestamp,Department,Category,Type,Detail,StreetAddress,Neighborhood,County,ZipCode,Latitude,Longitude,Status,ClosedDate,ExceededEstimatedTimeframe
0,C2019207923,2019-12-22T19:56:00Z,Public Works,Streets / Roadways / Alleys,Crack,District 1,10329 N Forest Ave,New Mark,Clay,64155.0,39.28196,-94.564453,RESOL,2020-06-26,Y
1,C2020054721,2020-04-18T17:10:00Z,Parks and Rec,Parks & Recreation,Park Maintenance,Central,400 W 31st St,Westside South,Jackson,64108.0,39.074934,-94.591904,RESOL,2020-04-30,N
2,C2019182182,2019-10-21T10:29:00Z,NHS,Property / Buildings / Construction,Dangerous Building,Standard,4043 Kenwood Ave,South Hyde Park,Jackson,64110.0,39.052983,-94.577808,RESOL,2020-08-03,Y
3,C2019184705,2019-10-25T10:02:00Z,NHS,Trash / Recycling,Recycling,Missed by City,637 E 62nd St,Western 49-63,Jackson,64110.0,39.01416,-94.579673,RESOL,2019-10-28,N
4,C2019184590,2019-10-25T04:44:00Z,Parks and Rec,Trees,Trimming,Tree Limbs,10901 Blue Ridge Blvd,Ruskin Heights,Jackson,64134.0,38.924085,-94.507192,RESOL,2019-12-04,Y


In [16]:
# Create the model and specify configuration
NUM_EPOCHS = 100
model = ACTGAN(
    verbose=True,
    binary_encoder_cutoff=10, # use a binary encoder for data transforms if the cardinality of a column is below this value
    auto_transform_datetimes=True,
    epochs=NUM_EPOCHS,
)

In [17]:
# Train the model on the training dataset
model.fit(train_df)

INFO:gretel_synthetics.actgan.actgan_wrapper:Attempting datetime auto-detection...
INFO:gretel_synthetics.actgan.actgan_wrapper:Using field types: {'CreationTimestamp': {'type': 'datetime', 'format': '%Y-%m-%dT%XZ'}, 'ClosedDate': {'type': 'datetime', 'format': '%Y-%m-%d'}}
INFO:gretel_synthetics.actgan.actgan_wrapper:Using field transformers: {'CreationTimestamp': UnixTimestampEncoder(missing_value_replacement='mean', model_missing_values=True, datetime_format='%Y-%m-%dT%XZ'), 'ClosedDate': UnixTimestampEncoder(missing_value_replacement='mean', model_missing_values=True, datetime_format='%Y-%m-%d')}
INFO:gretel_synthetics.actgan.data_transformer:Starting data transforms on 16 columns...
INFO:gretel_synthetics.actgan.data_transformer:Transforming discrete column: 'CaseID' with BinaryEncodingTransformer
INFO:gretel_synthetics.actgan.data_transformer:Transforming continuous column: 'CreationTimestamp.value' with ClusterBasedNormalizer
INFO:gretel_synthetics.actgan.data_transformer:Transf

In [19]:
# Sample synthetic data from the model
syn_df = model.sample(100)
syn_df.head()

Unnamed: 0,CaseID,CreationTimestamp,Department,Category,Type,Detail,StreetAddress,Neighborhood,County,ZipCode,Latitude,Longitude,Status,ClosedDate,ExceededEstimatedTimeframe
0,C2020073117,2021-05-21T10:45:58Z,NHS,Trash / Recycling,Trash Collection,Request New,3101 Paseo,Fairlane,Jackson,64127.0,39.20892,-94.598519,RESOL,2020-04-29,N
1,C2020002312,2019-06-17T16:42:30Z,KCPD,Water,Trash Collection,Missed by City,610 W 101st Ter,Clayton,Jackson,64124.0,38.925546,-94.439828,RESOL,2020-06-08,N
2,C2020133303,2020-03-09T16:59:19Z,,Streets / Roadways / Alleys,Pothole,At Curb / In Yard,6115 Tracy Ave,Winnetonka,Jackson,64130.0,38.922491,-94.48279,RESOL,2020-07-31,Y
3,C2019198177,2020-09-22T12:32:56Z,Water Services,Water,Snow / Ice,Street,11406 Crystal Ave,Eastwood Hills East,Jackson,64139.0,39.025139,-94.54721,RESOL,2020-07-22,N
4,C2019198151,2021-05-19T09:22:09Z,NHS,Streets / Roadways / Alleys,Trimming,At Curb / In Yard,5644 Swope Pkwy,Coleman Highlands,Jackson,64136.0,39.100338,-94.410814,RESOL,2019-11-22,Y


In [12]:
# Save the generated data to a CSV file
syn_df.to_csv("./out/syn.csv");

In [20]:
import gretel_synthetics.utils.stats as stats

In [21]:
stats.calculate_correlation(train_df)

Unnamed: 0,CaseID,CreationTimestamp,Department,Category,Type,Detail,StreetAddress,Neighborhood,County,ZipCode,Latitude,Longitude,Status,ClosedDate,ExceededEstimatedTimeframe
CaseID,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
CreationTimestamp,0.986343,1.0,0.931533,0.94516,0.959104,0.96521,0.981192,0.969627,0.925361,0.0,0.0,0.0,0.939484,0.972126,0.900225
Department,0.30103,0.166838,1.0,0.521593,0.394685,0.346908,0.171564,0.049228,0.075243,0.076702,0.107644,0.038409,0.124336,0.078333,0.049449
Category,0.349485,0.252231,0.777192,1.0,0.63238,0.523928,0.257632,0.071117,0.048494,0.095357,0.643081,0.706398,0.147338,0.119508,0.131261
Type,0.528486,0.391117,0.898661,0.966332,1.0,0.763542,0.394882,0.1714,0.09649,0.183,0.648327,0.709003,0.297012,0.240249,0.270973
Detail,0.60206,0.469074,0.941322,0.954112,0.909939,1.0,0.47405,0.285913,0.372252,0.377028,0.596588,0.592263,0.362987,0.310548,0.322495
StreetAddress,0.986773,0.980168,0.956925,0.964392,0.967326,0.974431,1.0,0.999946,0.999595,0.0,0.0,0.0,0.954756,0.970061,0.91419
Neighborhood,0.595504,0.55462,0.157219,0.15243,0.240414,0.336516,0.57256,1.0,0.98665,0.959815,0.540046,0.398112,0.167347,0.30927,0.043208
County,0.194538,0.070056,0.031806,0.013757,0.017913,0.05799,0.075755,0.130588,1.0,0.505028,0.738571,0.723066,0.017286,0.018008,0.0055
ZipCode,0.422549,0.367798,0.076702,0.095357,0.183,0.377028,0.384681,0.959815,0.505028,1.0,0.107992,-0.001081,0.078981,0.251836,0.004194


In [22]:
stats.calculate_correlation(syn_df)

Unnamed: 0,CaseID,CreationTimestamp,Department,Category,Type,Detail,StreetAddress,Neighborhood,County,ZipCode,Latitude,Longitude,Status,ClosedDate,ExceededEstimatedTimeframe
CaseID,1.0,0.986564,0.962763,0.972006,0.984045,0.978462,0.981662,0.978317,0.959668,0.0,0.0,0.0,0.930167,0.981276,0.938556
CreationTimestamp,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
Department,0.317048,0.477121,1.0,0.30406,0.295062,0.276762,0.325248,0.271286,0.219408,0.343591,0.34657,0.320054,0.190984,0.322742,0.232419
Category,0.425785,0.573064,0.40446,1.0,0.3561,0.422343,0.42754,0.369373,0.370391,0.0,0.0,0.0,0.161231,0.421714,0.296217
Type,0.567249,0.680864,0.516495,0.468607,1.0,0.562294,0.568571,0.525293,0.379952,0.0,0.0,0.0,0.654321,0.566896,0.353608
Detail,0.696367,0.784101,0.598128,0.68618,0.694223,1.0,0.694239,0.678207,0.499213,0.0,0.0,0.0,0.461676,0.700276,0.642532
StreetAddress,0.984672,0.988862,0.990691,0.979005,0.989364,0.978462,1.0,0.985544,0.959668,0.0,0.0,0.0,1.0,0.984396,0.959037
Neighborhood,0.829904,0.885426,0.698827,0.715307,0.773023,0.808383,0.833481,1.0,0.669405,0.0,0.0,0.0,0.903809,0.82685,0.626845
County,0.145889,0.238561,0.101286,0.128541,0.100201,0.106633,0.145443,0.119962,1.0,0.244214,0.366709,0.210865,0.074643,0.14109,0.022924
ZipCode,0.739892,0.795532,0.343591,0.526111,0.621925,0.661995,0.737631,0.701969,0.244214,1.0,0.035133,0.234284,0.002586,0.7321,0.016943
