<a href="https://colab.research.google.com/github/pegleggen/SynthData_Sungularity/blob/main/SynthData_Singularity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Welcome to Singularity's Synth Data**



This notebook demonstrates the generation of synthetic datasets for the AR ballet "Singularity," employing the MIT SynthDataVault library. The ballet involved the capture of movement and brainwave data from two dancers during the choreography creation process. Due to stringent privacy and ethical considerations, the dancers' actual data is not publicly available. The raw data examples provided here are representative samples from the authors, intended solely for demonstrating the synthetic data generation process.

The purpose of this was to generate synthetic datasets for the AR ballet - Singularity. The ballet recorded movement and brainwave data from two ballet dancers over a period of 6-8 months during the creation of the ballet choreography.



**Brainwave** data was captured using Emotiv&trade; Insight 5 channel headset. Raw data was exported to CVS format. Available in a viewable googke drive folder. The brainwave data was collected using an Emotiv&trade; Insight 5-channel EEG headset, with channels AF3, F7, F3, FC5, and AF4, sampled at 128Hz. Each row in the dataset represents a single timepoint, capturing the simultaneous movement and brainwave readings at that instant. This in linked in the code.

The **Movement** data was acquired using a motion capture pipeline RGB video to PostPose-Estimate system, recording at 60f.

*Note:* none of the dancers' data is publicaly available due to privacy and ethical reasons. The raw data available here is taken from the authors during ballet classes and other various activties.

**To cite this Colab Notebook** | CC BY-NC-SA 4.0: APA7

Smith-Nunes, G. (2025). SynthData Singularity [Python; Colab]. https://colab.research.google.com/drive/1_GPjenw4voPWL7STov81c3XeXISgTPnb#scrollTo=R63QJZHJAIKe






## The Data
>*The structure of the data is **Single table** containing rows and columns of information. In general each row typically represents a new entity such as a user, transaction, or session.  For this project each row is a timeseries stamp.*






# **Introduction**

##**What is synthetic data?**

Rubin ([1993](/https://www.google.com/url?q=https%3A%2F%2Fdoi.org%2F10.1016%2Fj.caeai.2023.100131)) first describes the novel development of synthetic microdata and defines it as ‘data sets consisting of records of individual synthetic units rather than actual units’. In his discussion article ‘Statistical Disclosure Limitation’, cited over 800 times,  he proposed the use of synthetic datasets as a way to mitigate violations of confidentiality and data privacy guidelines. The EU defines synthetic data as “[...] artificial data that is generated from original data and a model that is trained to reproduce the characteristics and structure of the original data” (EDPS, [2024b](https://www.edps.europa.eu/data-protection/our-work/ipen/ipen-webinar-2021-synthetic-data-what-use-cases-privacy-enhancing)). (Smith-Nunes & Ness, preprint)





##**Overview: Generating Synthetic Data**

This section outlines the process of generating synthetic datasets for the AR ballet "Singularity," designed to serve as a teaching resource. We'll walk through the key steps using the MIT SynthDataVault library, focusing on clarity and understanding.

**A. Data Preparation:**

* **Loading the Data:**
    * First, we load our sample movement and brainwave data from CSV files using the `pandas` library or from google drive. This step simulates how real-world data would be imported into a data science project.
    * Example:
        ```python
        import pandas as pd
        movement_data = pd.read_csv("movement_data.csv")
        brainwave_data = pd.read_csv("brainwave_data.csv")
        ```
* **Understanding the Data:**
    * We then inspect the data's structure (shape, columns, data types) to understand its characteristics. This helps us determine the appropriate synthetic data generation techniques.
    * Example:
        ```python
        print("Movement Data Shape:", movement_data.shape)
        print(movement_data.head())
        ```

**B. Choosing a Synthetic Data Generation Model:**

* **Gaussian Copula:**
    * For this example, we'll use the Gaussian Copula model from the SynthDataVault library. This model is effective for capturing complex relationships between variables in numerical datasets.
    * We choose this model for its relative ease of use, and its good performance with varied numerical data.
* **Rationale:**
    * We chose this specific model, highlighting its strengths and suitability for our data.

**C. Training the Model:**

* **Fitting the Model:**
    * We initialize the Gaussian Copula model and "train" it on our real data using the `fit()` method. This allows the model to learn the underlying statistical patterns.
    * Example:
        ```python
        from synthdatavault import GaussianCopula
        model = GaussianCopula()
        model.fit(movement_data) #training the model on the real data.
        ```
* **Parameter Explanation (If applicable):**
    * If you are changing parameters from the default, explain each paramaters function. If teaching you could add comments in the code to explain futher

**D. Generating Synthetic Data:**

* **Sampling:**
    * Once the model is trained, we use the `sample()` method to generate new, synthetic data points that resemble the original data. The `num_rows` parameter determines the size of the synthetic dataset.
    * Example:
        ```python
        synthetic_movement_data = model.sample(num_rows=len(movement_data)) #creating synthetic data
        ```
* **Visualising the Results:**
    * We use matplotlib to visualise the real data, and the synthetic data, to see how well the model performed.
    * Example:
        ```python
        import matplotlib.pyplot as plt
        plt.figure()
        plt.plot(movement_data['column_name'], label = 'Real Data')
        plt.plot(synthetic_movement_data['column_name'], label = 'Synthetic Data')
        plt.legend()
        plt.show()
        ```

**E. Key Concepts:**

* **Statistical Similarity:**
    * We emphasize that the goal is to create synthetic data that is statistically similar to the original data, meaning it shares the same overall patterns and distributions.
* **Privacy Preservation:**
    * We remind the students that the goal of synthetic data is to create data that can be used without leaking private information.

PURPOSE: Synthetic data allows for data sharing and exploration, without the ethical concerns of sharing real private data.



---

##**An Example of the completed program**

This is an illustrative example for you to see the complete code. It is aimed to demonstrate that with just a csv and a few lines of code you can create your own synthethic datasets. The code does not include any imports, it is just the main body of the code. Aimed to align with understanding the ethics of biometric data and coding practices.

In [None]:
# Load data
brainwave_data = pd.read_csv('/content/drive/My Drive/brainwaves/B1.csv')
print(brainwave_data.head())

# Detect metadata
metadata = Metadata.detect_from_dataframe(
    data=brainwave_data,
    table_name='brainwave_data')

# Create synthesizer
from sdv.single_table import GaussianCopulaSynthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)

# Fit synthesizer
synthesizer.fit(data=brainwave_data)

# Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=500)
print(synthetic_brainwave_data.head())

# Run diagnostic report
from sdv.evaluation.single_table import run_diagnostic
diagnostic_report = run_diagnostic(
    real_data=brainwave_data,
    synthetic_data=synthetic_brainwave_data,
    metadata=metadata)

# Run quality report
from sdv.evaluation.single_table import evaluate_quality
quality_report = evaluate_quality(
    brainwave_data,
    synthetic_brainwave_data,
    metadata)

# Visualize
from sdv.evaluation.single_table import get_column_plot
eeg_column_plot = get_column_plot(
    real_data=brainwave_data,
    synthetic_data=synthetic_brainwave_data,
    column_name='AF3', # Replace with an actual column name from your data
    metadata=metadata)
eeg_column_plot.show()

#Save Synthersizer
synthesizer.save('my_synthesizer.pkl')
synthesizer = GaussianCopulaSynthesizer.load('my_synthesizer.pkl')

**The following 1-7 steps breakdown each part of the synthetic data generation using MIT synthdata library and Python**

# **1 Install Library**

In [None]:
#intall library
%pip install sdv #or use favourite your package installer

# **2 Load the data**

The data set is an CVS of EEG data. The data was recorded during ballet classes or when editing articles. The metadata of each file will have appropriate details.

***UPLOAD DATA to notebook***

In [None]:
# prompt: google drive file import

from google.colab import drive
drive.mount('/content/drive')

# Explanation:
# The brainwave data for this notebook is shared in a Google Drive folder.
# To access the data, you need to add this shared folder to your own Google Drive.
# 1. Open the shared folder link: https://drive.google.com/drive/folders/1JypGwP-X2V-rWzqLe4HK6_3sWMHTsg_J?usp=sharing
# 2. While viewing the folder, click the "Add shortcut to Drive" icon (it looks like a folder with a plus sign).
# 3. Choose a location in your Drive to add the shortcut.

# Once the shortcut is added, you can access the files using the path to the shared folder
# within your mounted Google Drive.

# Example to list files in the shared folder:
# Replace 'My Drive/brainwaves' with the actual path to where you added the shortcut in your Drive.
!ls "/content/drive/My Drive/brainwaves"

# Example to read a CSV file from the shared folder
# Replace '/content/drive/My Drive/brainwaves/B1.csv' with the actual path to the file in your Drive
# (including the folder name you chose when adding the shortcut).
import pandas as pd
brainwave_data = pd.read_csv('/content/drive/My Drive/brainwaves/B1.csv')
print(brainwave_data.head())

# You can also try loading the other file, H1.csv:
# brainwave_data = pd.read_csv('/content/drive/My Drive/brainwaves/H1.csv')
# print(brainwave_data.head())

---

## **2.1 Loading your own (local) datasets**

A local dataset is a dataset that you have already downloaded onto your computer. These do not require any internet connectivity to access.
load_csvs
Use this method to load any datasets that are stored as CSVs.

Parameters:

(required) `folder_name`: A string with the name of the folder where the datasets are stored

`read_csv_parameters`: A dictionary with additional parameters to use when reading the CSVs. The keys are any of the parameter names of the `pands.read_csv` function and the values are your inputs.

Returns A dictionary that contains all the CSV data found in the folder. The key is the name of the file (without the .csv suffix) and the value is a pandas DataFrame containing the data.






In [None]:
from sdv.datasets.local import load_csvs

# assume that my_folder contains a CSV file named 'NAME.csv'

datasets = load_csvs(
    folder_name='my_folder/', #add correct folder name
    read_csv_parameters={
        'skipinitialspace': True,
        'encoding': 'utf_32'
    })

# the data is available under the file name
data = datasets['NAME'] #add correct name

---

## **2.2 Primary Key**:
*   XXX = is a primary key that uniquely identifies every row.
*   Other columns have a variety of data types and some of the data may be missing.

The data is available as a single table.





In [None]:
#get data
real_data.head()

---

## **2.3 MetaData**
>*metadata, a description of the dataset. It includes the primary keys as well as the data types for each column (called "sdtypes").* [see docs](https://docs.sdv.dev/sdv/single-table-data/data-preparation/creating-metadata)

Metadata helps the synthesizer understand the data structure and types.

**Parameters:**
(required) data: Your pandas DataFrame object that
contains the data

table_name: A string describing the name of your table. SDV will use the table name when referring to your table in the metadata, as well as any warnings or descriptive error messages.
(default) By default, we'll name your data table 'table'

**Output** A Metadata object that descibes the data
Copy




In [None]:

from sdv.metadata import Metadata
metadata = Metadata.detect_from_dataframe(
    data=my_dataframe,
    table_name='NAME') #update name

To see metadata of a file use visualise.

In [None]:
metadata.visualize()

**You may need to update the metadata.**

The detected metadata is not guaranteed to be accurate or complete. Be sure to carefully inspect the metadata and update it so it accurately represents your data.

Example:

>```
 metadata.update_column(
    column_name='start_date',
    sdtype='datetime',
    datetime_format='%Y-%m-%d')
metadata.update_column(
    column_name='user_cell',
    sdtype='phone_number',
    pii=True)
metadata.validate()
```




# **3 Creating the Synthetic Dataset**

The SDV creates synthetic data using machine learning.
You'll start by creating a synthesizer based on your metadata. Just to reiterate that the metadata elps the synthesizer understand the data structure and types.

Next, you'll train the synthesizer using real data. In this phase, the
synthesizer will learn patterns from the real data.

Once your synthesizer is trained, you can use it to generate new, synthetic data.

>*A **SDV synthesizer** is an object that you can use to create synthetic data. It learns patterns from the real data and replicates them to generate synthetic data. [See docs](https://https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers)*

Publicly Available Synthesizers


*   Gaussian Copula	(most functionality)
*   CTGAN
*   TVAE






---

## **3.1 Include Metadata**

When creating your synthesizer, you are required to pass in a **Metadata object** as the first argument. All other parameters are optional. You can include them to customize the synthesizer.

### Understanding the Synthesizer Parameters

When initialising the `GaussianCopulaSynthesizer`, we can include additional parameters to customise its behaviour. While we are using the default settings here, it's helpful to understand what some of these parameters do:

*   **`metadata` (required):** This is the most crucial parameter. It's the `Metadata` object we created earlier, which provides the synthesizer with information about your data's structure and column types.
*   **`enforce_min_max_values` (default: `True`):** When set to `True`, the synthesizer will ensure that the synthetic data generated for each numerical column stays within the minimum and maximum range observed in the real data. This can be useful to prevent generating unrealistic values.
*   **`enforce_rounding` (default: `False`):** If set to `True`, the synthesizer will round the generated numerical data based on the precision of the real data. This can be helpful if your original data has a specific number of decimal places.
*   **`numerical_distributions` (default: `None`):** This parameter allows you to specify the distribution to use for modeling individual numerical columns. If set to `None`, the synthesizer will automatically detect the best distribution. You could, for example, explicitly set a column to be modeled with a 'norm' (normal) distribution if you know it follows that pattern.
*   **`default_distribution` (default: `'beta`'):** This sets the default distribution to use for numerical columns if a specific distribution isn't provided in `numerical_distributions`. Common options include `'norm'` (normal) or `'beta'`. The default is often a good starting point, but you might experiment with this if your data doesn't seem to fit the default well.

By understanding these parameters, you can gain more control over the synthetic data generation process and potentially improve the quality of your synthetic dataset.


See example

In [None]:
synthesizer = GaussianCopulaSynthesizer(
    metadata, # required
    enforce_min_max_values=True,
    enforce_rounding=False,
    numerical_distributions={
        #'ADD': 'ADD', check file for correct names
        #'ADD': 'ADD'
    },
    default_distribution='norm'
)

Next, we can train the synthesizer. We pass in the real data so it can learn patterns using machine learning.

In [None]:
synthesizer.fit(
    data=real_data #note to self check name of file
)

*We can now use the synthesizer*

---

##**3.2 Advanced - Modifying Synthersizer**
[link text](https://)

>*You can control the **pre- and post-processing** steps in your synthesizer, and set up custom, anonymization controls. Pre and post-process your data by using reversible data transformations for each column. After assigning transformers, you can also modify them to customize the pre- and post-processing.*

>*You can also enforce **logical** rules in the form of constraints. The SDV library comes with some predefined constraint logic that is ready to use. You can define each constraint using a dictionary format.*

In [None]:
rk

---

### **3.3 Post Processing Advanced Step: Exploring Diffusion Generative Models**

While the Gaussian Copula is a powerful and versatile model for generating synthetic data, especially for tabular data, the field of synthetic data generation is constantly evolving. More advanced techniques are being developed, and one area of active research and application is **diffusion generative models**.

Diffusion models are a type of deep generative model that have shown remarkable success in generating complex data, particularly in areas like image and audio synthesis. The core idea behind diffusion models is a two-step process:

1.  **Forward Diffusion:** Gradually adding random noise to the real data until it becomes pure noise.
2.  **Reverse Diffusion:** Learning to reverse this noise process, starting from random noise and gradually removing it to generate new data samples that resemble the real data.

**Relevance to Brainwave and Movement Data:**

Diffusion models hold promise for generating highly realistic and nuanced synthetic time-series data, such as brainwave and movement recordings. Their ability to capture complex temporal dependencies and subtle variations in data could lead to synthetic datasets that are even more statistically similar to the original recordings, potentially improving the performance of downstream analyses or machine learning models trained on this data.

**Implementation Considerations (Advanced):**

Implementing diffusion models for tabular or time-series data like ours is more complex than using the Gaussian Copula. It typically involves:

*   Building or using pre-trained deep learning models.
*   Significant computational resources for training.
*   More advanced understanding of deep learning concepts.

While the SDV library primarily focuses on models like Gaussian Copula, CTGAN, and TVAE for tabular data, the underlying principles of diffusion models are relevant to the future of synthetic data generation for complex datasets.

**Note:** Implementing a diffusion generative model for this specific dataset is an advanced topic and is beyond the scope of this introductory notebook. However, it's important to be aware of these <mark>emerging techniques<mark> in the field of synthetic data.

# 4 Generating the synth dataset

Create realistic synthetic data data that follows the same format and mathematical properties as the real data

**Parameters**

(required) `num_rows`: An integer >0 that specifies the number of rows to synthesize

`batch_size`: An integer >0, describing the number of rows to sample at a time. If you are sampling a large number of rows, setting a smaller batch size allows you to see and save incremental progress. Defaults to the same as num_rows.

`max_tries_per_batch`: An integer >0, describing the number of sampling attempts to make per batch. If you have included constraints, it may take multiple batches to create valid data. Defaults to 100.

`output_file_path`: A string describing a CSV filepath for writing the synthetic data. Specify to None to skip writing to a file. Defaults to None.

Returns A pandas DataFrame object with synthetic data. The synthetic data mimics the real data.

### Use the function below and pass in number of rows

In [None]:
synthetic_data = synthesizer.sample(
    num_rows=500 #pass number of rows
)
synthetic_data.head()

---

### 4.1 Conditional Sampling

Do you have exact values or specific conditions that you'd like to influence the synthetic data generation? Using conditional sampling allows you to provide this information and generate synthetic data that meets those criteria.

Conditional sampling can be useful to:

*   **Generate hypothetical scenarios:** By fixing values or ranges to correspond to specific conditions (e.g., higher brainwave activity).
*   **De-bias your data:** By requesting an equal balance of data points that meet certain criteria.
*   **Impute unknown data:** By providing known data points and allowing the model to generate the rest based on those conditions.

Let's illustrate with an example using our brainwave data. Suppose we want to generate a subset of synthetic data where the 'AF3' brainwave channel shows activity above a certain threshold (e.g., a value greater than 4000).



In [None]:
//For example, a condition for low AF3 activity:

af3_low_activity_condition = Condition(num_rows=100 ,column_values={'AF3': (float('-inf'), 3800)})

synthetic_brainwave_data_af3_low = synthesizer.sample_conditions(
conditions=[af3_low_activity_condition])

print("\nSynthetic data with low AF3 activity:")

print(synthetic_brainwave_data_af3_low.head())

**Explanation:**

*   We import the `Condition` class from `sdv.sampling`.
*   We create a `Condition` object named `af3_high_activity_condition`.
    *   `num_rows=100` specifies that we want 100 synthetic rows that meet this condition.
    *   `column_values={'AF3': (4000, float('inf'))}` sets the condition for the 'AF3' column. The tuple `(4000, float('inf'))` indicates a value range from 4000 up to positive infinity (i.e., greater than 4000).
*   We use the `synthesizer.sample_conditions()` method, passing a list containing our defined `Condition` object. This method generates synthetic data that satisfies the specified conditions.
*   The resulting synthetic data is stored in the `synthetic_brainwave_data_af3_high` variable.
*   We print the head of the generated data to see the results.

This section now clearly explains conditional sampling, provides a relevant code example in a code block, and offers a brief explanation of the code for better understanding.

# **5 Dataset Evaluation**

Diagnostic tests comparing real data v synth data

*   real v synth
*   data quality
*   Anonymisation
*   Visualisation real v synth





---

## 5.1 Real v Synth

**Parameters:**

(required) `real_data`: A pandas.DataFrame containing the real data

(required) `synthetic_data`: A pandas.DataFrame containing the synthetic data

(required) `metadata`: A Metadata object with your metadata



`verbose`: A boolean describing whether or not to print the report progress and results. Defaults to `True`. Set this to `False` to run the report silently.

Returns: An SDMetrics DiagnosticReport object generated with your real and synthetic data.

The score should be 100%. The diagnostic report checks for basic data validity and data structure issues. You should expect the score to be perfect for any of the default SDV synthesizers.

**Run the following for a diagnostic test on your synthetic data.**

In [None]:
from sdv.evaluation.single_table import run_diagnostic

diagnostic_report = run_diagnostic(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata)

---

## 5.2 Data Quality

We can also measure the data quality or the statistical similarity between the real and synthetic data. This value may vary anywhere from 0 to 100%.


Use this function to run a evaluation on the synthetic data.

In [None]:
from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    real_data,
    synthetic_data,
    metadata
)

The report output gives a score 0-100, the synthetic data is about X% similar to the real data in terms of statistical similarity.

A 100% score means that the patterns are exactly the same. For example, if you compared the real data with itself (identity), the score would be 100%.

A 0% score means the patterns are as different as can be. This would entail that the synthetic data purposefully contains anti-patterns that are opposite from the real data.

Any score in the middle can be interpreted along this scale. For example, a score of 80% means that the synthetic data is about 80% similar to the real data — about 80% of the trends are similar.

We can also get more details from the report. For example, the Column Shapes sub-score is XX%. Which columns had the highest vs. the lowest scores?

Table **TO COMPLETE**

Score  | Commentary
-------------------|------------------
Row 1, Col 1       | Row 1, Col 2
Row 2, Col 1       | Row 2, Col 2

Save the report output of diagnostics

In [None]:
quality_report.save(filepath='results/quality_report.pkl')

---

## 5.3 Anonymisation

Anonymisation is the process of removing or obscuring personally identifiable information (PII) from a dataset. Synthetic data can be used to create a version of your data that preserves statistical patterns but removes direct links to individuals.

While our brainwave data doesn't contain obvious PII like names or addresses, if you were working with a dataset that did, here's how you might identify and examine those columns.

Let's assume you have a CSV file that includes columns like 'Participant_Name', 'Age', and 'Location' which could be used to identify individuals.





In [None]:
# edit to match our data
sensitive_column_names = ['Participant_Name', 'Age', 'Location']

real_data[sensitive_column_names].head(3)

synthetic_data[sensitive_column_names].head(3)

**Explanation:**

*   We define a list `sensitive_column_names` containing the names of columns that might contain personally identifiable information. **Remember to replace these hypothetical column names with the actual names from your dataset if it contains PII.**
*   We then display the head of both the real data and the synthetic data for these specified columns.
*   The goal of a good synthetic data generation process (especially when dealing with PII) is that the synthetic data in these columns either doesn't exist, is generalised, or is transformed in a way that it cannot be linked back to individuals in the real data, while still maintaining the overall statistical characteristics that are not related to individual identity.

**Important Note:**

The SDV library offers features for handling PII during synthetic data generation, such as using specific data transformers or setting columns as PII in the metadata. For datasets with sensitive information, it is crucial to configure the synthesizer appropriately to ensure adequate anonymization and privacy preservation. Refer to the SDV documentation on handling PII for more advanced techniques.



---

## 5.4 Comparison Case - Augmented Data

Beyond generating synthetic data that mimics the original dataset, another valuable technique is data augmentation. Data augmentation involves creating new data points by applying transformations or combinations to the existing real data. This can be useful for:

*   **Increasing dataset size:** Creating more data for training machine learning models, especially when the original dataset is small.
*   **Improving model robustness:** Generating variations of existing data to make models less sensitive to noise or minor inconsistencies.
*   **Exploring hypothetical scenarios:** Simulating different conditions or combinations of data that weren't captured in the original recordings.

In the context of our project with brainwave and movement data, augmented data could be created by:

*   Adding controlled levels of noise to the original sensor readings to simulate variations in recording quality.
*   Combining movement sequences or brainwave patterns from different recording sessions to create novel combinations.
*   Slightly shifting or scaling time series data to represent minor variations in timing or amplitude.

Comparing synthetic data generated by the SDV synthesizer to this type of augmented data can provide insights into:

*   **The synthesizer's ability to capture variability:** How well does the synthetic data reflect the range of variations present in the augmented data?
*   **The impact of augmentation vs. synthesis:** How do the statistical properties of purely synthetic data compare to data created by augmenting real data?
*   **Potential for combining techniques:** Could a combination of synthetic data generation and data augmentation lead to even richer and more diverse datasets for analysis or model training?

While we won't be implementing data augmentation in this notebook, understanding this concept provides another perspective on creating varied datasets from limited original data.

# 6 Visualising the Data

Visualise the real vs. synthetic data.

Perform a 1D visualisation comparing a column of the real data to the synthetic data.



---

## 6.1 Colour Plot

Visualise real vs. synthetic data by comparing pairs of columns. This helps assess how well the synthesizer captures the relationships between different variables.




In [None]:
from sdv.evaluation.single_table
import get_column_pair_plot

# Select two brainwave channels to compare (replace with your desired columns)

brainwave_columns_to_compare = ['AF3', 'F7']

fig = get_column_pair_plot( real_data=brainwave_data, # Use the variable name for your real brainwave data
  synthetic_data=synthetic_brainwave_data, # Use the variable name for your synthetic brainwave data
  column_names=brainwave_columns_to_compare, metadata=metadata) # Use the variable name for your metadata

fig.show()

**Explanation:**

*   We import the `get_column_pair_plot` function from `sdv.evaluation.single_table`.
*   We define a list `brainwave_columns_to_compare` with the names of two brainwave channels you want to visualise the relationship between. You can change these to any two columns in your data.
*   We call `get_column_pair_plot`, passing in the real and synthetic brainwave data, the list of column names to compare, and the metadata.
*   The resulting plot (`fig`) shows scatter plots for each pair of columns, comparing the distributions and relationships in the real and synthetic data. Look for similar patterns and distributions in the plots to assess the quality of the synthetic data.

In [None]:
from sdv.evaluation.single_table
import get_column_plot

# Select a single brainwave channel to visualize (replace with your desired column)

brainwave_column_to_plot = 'AF3'

fig = get_column_plot(
    real_data=real_data, # use your variable name
    synthetic_data=synthetic_brainwave_data,
    column_name=brainwave_column_to_plot, #edit to match our data
    metadata=metadata)

fig.show()

**Explanation:**

*   We import the `get_column_plot` function.
*   We define `brainwave_column_to_plot` with the name of a single brainwave channel you want to visualise.
*   We call `get_column_plot`, passing in the real and synthetic data, the column name, and the metadata.
*   The resulting plot (`fig`) shows the distribution of values for the selected column in both the real and synthetic data. This helps you visually compare how well the synthetic data captures the distribution of individual columns.



# 7 Saving Synthesizer

Save your synthesizer for use with other datasets or to share.

In [None]:
synthesizer.save('my_synthesizer.pkl')

synthesizer = GaussianCopulaSynthesizer.load('my_synthesizer.pkl')

# Further Reading:

Adams, C., Pente, P., Lemermeyer, G., & Rockwell, G. (2023). Ethical principles for artificial intelligence in K-12 education. Computers and Education: Artificial Intelligence, 4, 100131. https://doi.org/10.1016/j.caeai.2023.100131

Agencia Española Proteccíon Datos (AEPD). (2019). A Guide to Privacy by Design. AEPD. https://www.aepd.es/guides/guide-to-privacy-by-design.pdf

Agencia Española Proteccíon Datos (AEPD). (2023, November 2). Synthetic data and data protection | AEPD. https://www.aepd.es/en/prensa-y-comunicacion/blog/synthetic-data-and-data-protection

Clevr | TensorFlow Datasets. (n.d.). TensorFlow. Retrieved October 18, 2024, from https://www.tensorflow.org/datasets/catalog/clevr

CMU, Carnegie Mellon University. (2024). LearnSphere DataShop (Version 11) [Computer software]. https://pslcdatashop.web.cmu.edu/index.jsp?datasets=public

Collins, J., Goel, S., Deng, K., Luthra, A., Xu, L., Gundogdu, E., Zhang, X., Vicente, T. F. Y., Dideriksen, T., Arora, H., Guillaumin, M., & Malik, J. (2022). ABO: Dataset and Benchmarks for Real-World 3D Object Understanding. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 21094–21104. https://doi.org/10.1109/CVPR52688.2022.02045

Dankar, F. K., & Ibrahim, M. (2021). Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation. Applied Sciences, 11(5), 2158. https://doi.org/10.3390/app11052158

Emam, K., Mosquera, L., Hoptroff, R., & Safari, an O’Reilly Media Company. (2020). Practical Synthetic Data Generation (1st edition). O’Reilly Media, Inc. https://www.safaribooksonline.com/complete/auth0oauth2/&state=/library/view//9781492072737/?ar

EU Artificial Intelligence Act. (2023)). Retrieved October 6, 2024, from https://artificialintelligenceact.eu/documents/


EDPS. (2024a, October 4). IPEN Webinar 2021 - “Synthetic data: What use cases as a privacy enhancing technology?” | European Data Protection Supervisor. https://www.edps.europa.eu/data-protection/our-work/ipen/ipen-webinar-2021-synthetic-data-what-use-cases-privacy-enhancing

EDPS. (2024b, October 4). Synthetic Data | European Data Protection Supervisor. https://www.edps.europa.eu/press-publications/publications/techsonar/synthetic-data

FERPA | Protecting Student Privacy. (n.d.). Retrieved November 7, 2024, from https://studentprivacy.ed.gov/ferpa

General Data Protection Regulation (GDPR) Compliance Guidelines. (n.d.). GDPR.Eu. Retrieved October 6, 2024, from https://gdpr.eu/

Gupta, M. (2023). Advances in AI: Employing Deep Generative Models for the Creation of Synthetic Healthcare Datasets to Improve Predictive Analytics. 2023 International Conference on Communication, Security and Artificial Intelligence (ICCSAI), 1026–1030. https://doi.org/10.1109/ICCSAI59793.2023.10421464

Hradec, J., Craglia, M., Di, L. M., De, N. S., Ostlaender, N., & Nicholson, N. (2022, June 13). Multipurpose synthetic population for policy applications. JRC Publications Repository. https://doi.org/10.2760/50072

Johnson, J., Hariharan, B., Maaten, L. van der, Fei-Fei, L., Zitnick, C. L., & Girshick, R. (2016). CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning (No. arXiv:1612.06890). arXiv. http://arxiv.org/abs/1612.06890

Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Deitke, M., Ehsani, K., Gordon, D., Zhu, Y., Kembhavi, A., Gupta, A., & Farhadi, A. (2022). AI2-THOR: An Interactive 3D Environment for Visual AI (No. arXiv:1712.05474). arXiv. https://doi.org/10.48550/arXiv.1712.05474

Marshall, V., Markham, C., Avramovic, P., Comerford, P., Maple, C., & Szpruch, L. (2023, June 12). Exploring Synthetic Data Validation – Privacy, Utility and Fidelity. FCA Official, 11. https://ico.org.uk/media/for-organisations/documents/4025484/sythetic-data-roundtable-202306.pdf

McCarthy, N., & Fourniol, F. (2020). The role of technology in governance: The example of Privacy Enhancing Technologies. Data & Policy, 2, e8. https://doi.org/10.1017/dap.2020.8

Mahon, J., Quille, K., Mac Namee, B., & Becker, B. A. (2022). A Novel Machine Learning and Artificial Intelligence Course for Secondary School Students. Proceedings of the 53rd ACM Technical Symposium on Computer Science Education V. 2, 1155–1155. https://doi.org/10.1145/3478432.3499073

Mo, K., Zhu, S., Chang, A. X., Yi, L., Tripathi, S., Guibas, L. J., & Su, H. (2018). PartNet: A Large-scale Benchmark for Fine-grained and Hierarchical Part-level 3D Object Understanding (No. arXiv:1812.02713). arXiv. http://arxiv.org/abs/1812.02713

National Center for Education Statistics. (NCES). Common Education Data Standards (CEDS). https://ceds.ed.gov/reportShow.aspx?Term_x_TopicIds=All&ReportType=Data%20Dictionary%20CEDS%20Info&MapIds=6812&shareId=913d0707-c43c-4527-a703-ec0a414d66f6

National Center for Education Statistics (NCESS) (2024, October 17). Statistics Netherlands updates its Statistical Disclosure Control Guide [Webpagina]. Statistics Netherlands. https://www.cbs.nl/en-gb/corporate/2024/42/statistics-netherlands-updates-its-statistical-disclosure-control-guide

Nouri, J., Ebner, M., Ifenthaler, D., Saqr, M., Malmberg, J., Khalil, M., Bruun, J., Viberg, O., González, M. Á. C., Papamitsiou, Z., & Berthelsen, U. D. (2019). Efforts in Europe for Data-Driven Improvement of Education – A Review of Learning Analytics Research in Seven Countries. International Journal of Learning Analytics and Artificial Intelligence for Education (iJAI), 1(1), Article 1. https://doi.org/10.3991/ijai.v1i1.11053

OECD: AI principles. (2019). OECD. Retrieved November 7, 2024, from https://www.oecd.org/en/topics/ai-principles.html

Patki, N., Wedge, R., & Veeramachaneni, K. (2016). The Synthetic Data Vault. 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 399–410. https://doi.org/10.1109/DSAA.2016.49

Petrovic, O., Duarte, D., Storms, S., Herfs, W.  (2023). Towards Knowledge-based Generation of Synthetic Data by Taxonomizing Expert Knowledge in Production. In: Tareq Ahram, Waldemar Karwowski, Pepetto Di Bucchianico, Redha Taiar, Luca Casarotto and Pietro Costa (eds) Intelligent Human Systems Integration (IHSI 2023): Integrating People and Intelligent Systems. AHFE (2023) International Conference. AHFE Open Access, vol 69. AHFE International, USA. http://doi.org/10.54941/ahfe1002915

Raghunathan, T. E. (2021). Synthetic Data. The Annual Review of Statistics and Its Application, 8, 129-14-. https://doi.org/10.1146/annurev-statistics-040720-031848

Ros, G., Sellart, L., Materzynska, J., Vazquez, D., & Lopez, A. M. (2016). The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3234–3243. https://doi.org/10.1109/CVPR.2016.352

Rubin, D. (1993). Discussion Statistical disclosure limitation. Journal of Official Statistics, 9(2), 461–468. https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/discussion-statistical-disclosure-limitation2.pdf

Shaaban, O. (2023). The Impact of Pre-trained Transformer-Based Language Model Use on Student Learning Outcomes in Higher Education—A Mixed-Methods Research Approach with a Case Study of IMC Fachhochschule Krems. https://doi.org/10.13140/RG.2.2.11232.87048

Shafiq, D. A., Marjani, M., Habeeb, R. A. A., & Asirvatham, D. (2022). Student Retention Using Educational Data Mining and Predictive Analytics: A Systematic Literature Review. IEEE Access, 10, 72480–72503. IEEE Access. https://doi.org/10.1109/ACCESS.2022.3188767


Simionescu, C., Danubianu, M., & Maciuca, M. S. (2023). How Data Mining and Artificial Intelligence can Contribute to Increasing Academic Performance. Didactica Danubiensis, 3(1), Article 1.
The Artificial Intelligence Act, No. OJ L, 2024/1689, 12.7.2024, European Parliament and Council (2024). https://artificialintelligenceact.eu/wp-content/uploads/2024/02/AIA-Trilogue-Coreper.pdf


van der Sangen, M. (2023, May 23). Synthetic data opens up possibilities in the statistical field | CBS. https://www.cbs.nl/en-gb/corporate/2023/20/synthetic-data-opens-up-possibilities-in-the-statistical-field

Vie, J.-J., Rigaux, T., & Minn, S. (2022). Privacy-Preserving Synthetic Educational Data Generation. In I. Hilliger, P. J. Muñoz-Merino, T. De Laet, A. Ortega-Arranz, & T. Farrell (Eds.), Educating for a New Future: Making Sense of Technology-Enhanced Learning Adoption (Vol. 13450, pp. 393–406). Springer International Publishing. https://doi.org/10.1007/978-3-031-16290-9_29

Williamson, B., Eynon, R., & Potter, J. (2020). Pandemic politics, pedagogies and practices: Digital technologies and distance education during the coronavirus emergency. Learning, Media and Technology, 45(2), 107–114. https://doi.org/10.1080/17439884.2020.1761641



