Python Synthetic Data Vault

This library is an implementation of "The Synthetic Data Vault" described in the following paper works from the MIT:

This library was realized in the context of my 4th year internship of computer science engineering school at Ulster University, Derry, Northern Ireland.

🎙️ DISCLAIMER: This implementation may not be exactly conform to the original, and hence behavior can differ.

⚠️ WARNING: The version on master might not works with your data set, see the Versions section for more information

Python Synthetic Data Vault
- Getting started
- Versions

Getting started

Prerequisites

Python 3.6 or higher

Installing

git clone https://github.com/Taken0711/SDV-python.git
./install.sh

It installs the following dependencies:

NumPy
SciPy
Pandas

Basic usage

The project is built as a library, so you just need to import it in your python code.

import sdv

And then use it by calling the syn function.

res = sdv.syn(metadata, data)

The signature of syn is described below.

sdv.syn(metadata, data, size=1, header=None)

This return a table with the generated synthetic data, according to the original SDV methodology.

Parameters:

metadata: array_like
- A 1-D array containing the type of data of each column. The available types are sdv.INT, sdv.FLOAT, sdv.CATEGORICAL and sdv.DATE.
data: array_like
- A 2-D array containing the data to synthesize. Types must respect the types described in the metadata.
size: integer, optional
- An integer specifying the number of rows to generate
header: array_like, optional
- A 1-D array containing the name of the columns. This is used in logs, to help to track which distribution is used to which column.

Returns:

res: array_like
- The generated data set. The size is defined by the size parameter.

Complete example:

import csv
import sdv

with open("iris.data", "r") as in_file, open("synthetic_iris.data", "w") as out_file:
    iris_reader = csv.reader(in_file)
    header = next(iris_reader)
    ods = [e for e in iris_reader]

    metadata = [sdv.FLOAT, sdv.FLOAT, sdv.FLOAT, sdv.FLOAT, sdv.CATEGORICAL]
    sds = sdv.syn(metadata, ods, size=len(ods), header=header)

    iris_writer = csv.writer(out_file)
    iris_writer.writerow(header)
    iris_writer.writerows(sds)

Usage of the analysis class by class

This implementation also include an analysis by class. It offer better results than the basic syn function by analyzing every class independently.

It can be called as follows:

res = sdv.syn_by_class(metadata, data, class_column)

The signature of syn is described below.

sdv.syn_by_class(metadata, data, class_column, size=1, header=None):

This return a table with the generated synthetic data, by analyzing class by class.

Parameters:

metadata: array_like
- see syn
data: array_like
- see syn
class_column: integer
- An integer specifying the index of the column with the class values
size: integer, optional
- see syn
header: array_like, optional
- see syn

Returns:

res: array_like
- see syn

Complete example:

import csv
import sdv

with open("iris.data", "r") as in_file, open("synthetic_iris.data", "w") as out_file:
    iris_reader = csv.reader(in_file)
    header = next(iris_reader)
    ods = [e for e in iris_reader]

    metadata = [sdv.FLOAT, sdv.FLOAT, sdv.FLOAT, sdv.FLOAT, sdv.CATEGORICAL]
    sds = sdv.syn_by_class(metadata, ods, 4, size=len(ods), header=header)

    iris_writer = csv.writer(out_file)
    iris_writer.writerow(header)
    iris_writer.writerows(sds)

Examples

The example folder contains usage of the library using multiple data set and both methods.

Versions

As said in the warning, the version on master isn't working for certain data set (I don't exactly know why, it seems to be a type issue with numpy). Thus, there are two versions available:

🐇 Performance version: Works only on certain sets, but around 20 times faster than the slow version
🐢 Slow version: Works for every sets (as far as I know) but slowly

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
examples		examples
sdv		sdv
.gitignore		.gitignore
README.md		README.md
install.bat		install.bat
install.sh		install.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Python Synthetic Data Vault

Getting started

Prerequisites

Installing

Basic usage

Usage of the analysis class by class

Examples

Versions

About

Uh oh!

Releases

Packages

Languages

jjunac/SDV-python

Folders and files

Latest commit

History

Repository files navigation

Python Synthetic Data Vault

Getting started

Prerequisites

Installing

Basic usage

Usage of the analysis class by class

Examples

Versions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages