# Create a Low-Fidelity Synthetic Data Set

This notebook was released with the report [Accelerating public policy research
with synthetic data](https://www.adruk.org/news-publications/news-blogs/report-investigates-how-synthetic-data-can-be-used-in-government/).

Dr. Paul Calcraft, Dr. Iorwerth Thomas, Martina Maglicic, Dr. Alex
Sutherland

Please contact iori.thomas@bi.team if you have any queries or suggestions.

## Table of contents
1. [Introduction](#introduction)
2. [Load in your data set](#load_data)
3. [Define relevant functions](#functions)
4. [Generate synthetic data](#generate_data)

## Introduction <a name="introduction"></a>

Synthetic data is artificially generated data that preserves the statistical properties of a data set while containing entirely different records. If used correctly, it represents less disclosure risk than the original data and could therefore be shared with fewer security requirements. This notebook allows you to create a synthetic version of your data set, with only minimum input from yourself. All you have to do is load in a data set of your choice and the script will do the rest. 

More specifically, the code will do three things: 
1. Extract information from the data set you've uploaded (e.g. the number of rows, whether a column contains numbers, structure etc.) 
2. Categorize each column according to its values (this can be numeric, categorical, datetime, string or NA)
3. Generate new data that imitates key features of the original data set (so-called synthetic data)

The only section that requires your input is [loading in the data set](#load_data), where you need to specify the name of the file you're uploading and the full path to the file. Once you've done that, you can run the script from start to finish and it should output a synthetic data set, which will be saved to your current working directory unless otherwise specified. If you wish, you can change the variable classification, for example if a variable has been categorized as string but it is in fact categorical - however, this is optional. 

As the synthetic data created with this script set preserves statistical properties incl. variable types, frequency of missing values and data set structure, but does not capture relationships between variables, the output data can be used for a variety of tasks. For example, it can be used as test data for analysis code (prior to the receipt of the original data), in order to train practitioners in how to use or analyse administrative data, for exploratory analysis or to familiarise any interested party with the data set and the questions it can answer. However, we recommend that you thoroughly inspect the resulting data set and check for (accidental) correlations prior to publishing it to ensure no sensitive information is being passed on.

## 1. Load in data set <a name="load_data"></a>

**This section requires your input.**

We start by loading in necessary packages. For simplicity, we have limited dependencies to the minimum possible.

### Automated package installation (optional)

In [None]:
#install any necessary packages
"""import sys
!{sys.executable} -m pip install --user --upgrade setuptools
!{sys.executable} -m pip install --user numpy==1.19
!{sys.executable} -m pip install --user pandas>=0.24

from packaging import version
import pkg_resources

import pandas as pd

if version.parse(pd.__version__) < version.parse("1.0.5"):
    # require older numpy if we have old pandas
    del pd
    pkg_resources.require("numpy==1.19.0")
    pkg_resources.require("pandas>=0.23") """


### Import packages

In [None]:
# load necessary packages
import numpy as np
import pandas as pd
import datetime
import sys
import os
import random
import gc
from datetime import date

In order to load in your data set, please specify the file path. For example:

file_path = "C:\Users\FirstName.LastName\Downloads\General\data.csv" 

In the case that you keep the script in the same working directory as your data, you only need to specify the file name and extension. For example:

file_path = "data.csv"

Note that you should also provide a file name **without extension**, which will later be used to save the synthetic data set generated by the script ("_synthetic" will be automatically added to the file name). Else the file will be saved as "_synthetic.csv".

Note that apart from Microsoft Excel files, only two dimensional data is currently supported for automatic upload and each column has to be a separate variable (as opposed to variables being contained in rows).

**Please specify the file path below.**

In [None]:
# 1) put in your full file path here, e.g. "C:\\Users\\FirstName.LastName\\Downloads\\General\\data.csv" 

file_path = "" 

In [None]:
MODE = "normal"

# This detects if you're using a .sav file -- if you have python3 it will attempt to preserve metadata from the file
if (file_path.endswith('sav') and sys.version_info[0] == 3):
    import pyreadstat
    MODE = "sav_file"


Once you have specified the file path and name above, you can run the entire script, which will automatically load the file, generate a synthetic data set and save it in your working directory.

### Using a data pipeline

Alternatively, if you are reading in the data from a database pipeline, uncomment the following: 


In [None]:
#MODE = "pipeline"

Now uncomment the following lines and replace LIBRARY with the python library you use to access the pipeline, and PIPELINE_QUERY with the appropriate function for accessing the pipeline.  You will also need to define the name of the output file without an extension here.

In [None]:
#include LIBRARY
#original_data = PIPELINE_QUERY
#file_name = "OUTPUT_FILE_NAME"

### Modifying the output appearance

In [None]:
# Specify how you want null or NaN values in your data to be represented in the synthetic data output
null_string = ""

# Specify precision of real numerical columns
numerical_precision = 2


**The sections below do not require your input. Edit only if you would like to customize features.**

## 2. Define relevant functions <a name="functions"></a>

This section defines all of the custom-made functions used later in the script to produce a synthetic data set (incl. the function used to classify variables and generate new data). It's worth reading this section if you want to get a deeper understanding of how the code works and what it does. 

### read_data()
This function is used to read in your data. It detects the type of data you are trying to load (using the extension) and then picks out the correct pandas function to read in your data as a pandas dataframe. Currently 8 data types are supported: 'csv', 'txt', 'xlsx', 'xls', 'sas7bdat', 'sav', 'dta' and 'pkl'. Note that this function assumes the first row of data is a header.

In [None]:
def read_data(x):
    if (x.endswith(('csv', 'txt'))):
        return pd.read_csv(x) 
    elif (x.endswith(('xlsx', 'xls'))):
        dictionary = pd.read_excel(x, sheet_name = None) 
        if (len(dictionary.keys()) == 1):
            name_of_sheet = list(dictionary.keys())
            name_of_sheet = name_of_sheet[0]
            simple_data_frame = dictionary[name_of_sheet]
            return simple_data_frame
        else:
            return dictionary
    elif (x.endswith('sas7bdat')):
        return pd.read_sas(x) 
    elif (x.endswith('sav')):
        return pd.read_spss(x) 
    elif (x.endswith('dta')):
        return pd.read_stata(x)  
    elif (x.endswith('pkl')):
        return pd.read_pickle(x) 
    else:
        raise Exception("Sorry, file type not supported. Try converting to csv, xlsx, txt or pkl.")


### check_if_datetime()

This function is used in the variable classification process. To check whether a variable contains only dates or times, we try to convert the column to datetime format using pandas, and if this yields no errors, we return 'True' else 'False'.

In [None]:
def check_if_datetime(x):
    try:
        pd.to_datetime(x, format='mixed', dayfirst=True)
    except (RuntimeError, TypeError, NameError, IOError, ValueError):
        return False
    else:
        return True

### check_if_numeric()

To identify numeric variables, we try to convert columns to numeric using pandas, and if this yields no errors, we return 'True' else 'False'.

In [None]:
# Define function
def check_if_numeric(x):
    try:
        pd.to_numeric(x)
    except (RuntimeError, TypeError, NameError, IOError, ValueError):
        return False
    else:
        return True

### identify_variable_type()

With this function, we classify variables according to type. We check whether a variable is NA (i.e. empty), catgorical, numeric, datetime or string in the specified order. The function stops evaluating and assigns a type as soon as one of the below conditions is found to be true.

- **NA** columns are columns that are empty, i.e. only have NA or null values.
- **Categorical** columns have fewer than n number of unique values (n is either 100, or if the number of rows is very low, a third of the column length)
- **Numeric** columns only have numeric values and values associated with numbers (e.g. a period or a minus), or are predominanly numeric but have very few unique values that do not fit the pattern. The latter is used to capture cases where NAs are represented by strings such as 'missing'.
- **Datetime** columns are those columns that can be parsed into a datetime (meaning they are represented in a commonly accepted date format), or where most values can be parsed into datetime with the exception of a low number of unique values. Again, this is used to capture cases where NAs are represented by strings such as 'missing'.
- Everything else is classified as **string**.

In [None]:
def identify_variable_type(x):
    # Is the column empty? If so, it will be classified as 'NA':
    if (x.dropna().empty == True):
        return "NA"
    # Is the variable categorical? We check the number of unique values:
    if ((x.dropna().shape[0] >= 300 and x.dropna().nunique()<100) or (x.dropna().nunique()<len(x)*0.3 and x.dropna().shape[0] < 300)):
        return "categorical"
    # If no numbers are present, we classify it as a string:
    elif(x.astype(str).str.contains(r"[0-9]").any() == False):
        return "string"
    # We then check if it's numeric, or predominantly numeric with some exceptions:
    elif(check_if_numeric(x) == True): 
        if x.min() > 19200101 and x.max() < 20260101:
            return "datetime"
        return "numeric"
    elif(x.astype(str).str.contains(r"[a-zA-Z]").any() == True and 
         x[x.astype(str).str.contains(r"[a-zA-Z]")].nunique()<11 and 
         check_if_numeric(x[x.astype(str).str.contains(r"[^a-zA-Z]")]) == True):
        return "numeric"
    # next, we check if it's a date or a time, or predominantly datetime with some exceptions:
    elif(check_if_datetime(x) == True):
        return "datetime"
    elif(x.astype(str)[x.astype(str).str.contains(r"[0-9]") == False].nunique() < 11 and
         check_if_datetime(x[x.astype(str).str.contains(r"[0-9]") == True]) == True):
        return "datetime"
    # If none of the above apply, we classify the variable as string:
    else:
        return "string"

### prepend(list, str)
This is a simple function created for appending variable types to column names.

In [None]:
def prepend(list, str): 
    str += '{0}'
    list = [str.format(i) for i in list] 
    return(list) 

### generate_datetime(min_time, max_time)

This function is used in the process of generating synthetic datetime data. For simplicity, we assume a uniform distribution and randomly draw a timepoint that lies between the earliest and latest time recorded.

In [None]:
def generate_datetime(min_time, max_time, to_floor = None):
    start = pd.to_datetime(min_time)
    end = pd.to_datetime(max_time)
    random_date = start + (end - start) * random.random()
    if to_floor:
        return random_date.floor(to_floor)
    return random_date


### paste0(string, values)

This function helps us create new column names and imitates the R function 'paste0' (essentially appending values to strings).

In [None]:
def paste0(string, values):
    texts = [string + str(num1) for num1 in values]
    return texts

### create_synthetic_data()

This is the function for generates synthetic data. It assumes that variables have already been classified by type, and that these types have been appended to the column names. The function uses this information to create a similar data set; similar here means that variable type, values and/or patterns of individual variables will be preserved, but relationships between variables will not. Depending on variable type, this will entail a different procedure:
- **NA**: We return an empty column.
- **Numeric**: For simplicity, we assume that data points are normally distributed. We extract the mean and standard deviation from the variable and use this to simulate new data.
- **Categorical**: We cross-tabulate categories and their associated frequencies, and convert these frequencies into probabilities that form a distribution from wich we simulate new data (e.g. if 'yes' and 'no' each appear 50% of the time in the existing column, then 'yes' and 'no' have a 50% chance of being drawn at each round as we generate new data points for the synthetic column).
- **Datetime**: For simplicity, we assume that data points follow a uniform distribution and randomly draw time points that lie between the earliest and latest time recorded.
- **String**: There are two possible outcomes depending on whether we detect a pattern. We assume a pattern is present if entries have roughly the same length, as a string with a pattern is likely to have similar lengths across entries (this will be the case for e.g. post codes and IP addresses, but not for case notes or customer reviews). If there is no pattern, we use a placeholder (‘sample text’) to create a new column. If a pattern is detected, we split each string into individual symbols and spread them across multiple columns by position (so the first symbol will be in the first column, the second symbol in the second column etc.) For each position, we compute the probability of each symbol occuring and then draw from the resulting distribution. This is done as many times as there are rows in the original data set. Finally, we merge the resulting symbols (previously split by position) into a string and return this new column as the final result.


In [None]:
def create_synthetic_data(x):
    # 1. Extract key information about the variable
    nrow = x.shape[0]
    nrow_NA = x.isnull().sum()
    min_value = round(nrow_NA * 0.7)
    max_value = nrow if nrow_NA * 1.3 > nrow else int(round(nrow_NA * 1.3))
    nrow_NA = np.random.uniform(low = min_value, high = max_value)    
    nrow_non_NA = int(round(nrow-nrow_NA))
    # 2. Define procedure for empty columns (return an empty column)
    if (x.name == None or x.name.endswith('NA')):
        new_col = [np.nan for i in range(nrow)]
        new_col = pd.Series(new_col)
        return new_col
    # 1. Define procedure for categorical variables
    elif (x.name.endswith('categorical')):
        # a) cross tab data (%)
        cross_tab = x.value_counts(dropna = False, normalize=True)
        # b) extract categories and frequencies
        values = cross_tab.axes[0].tolist()
        probs = cross_tab.tolist()
        # c) create new column using numer of rows, categories and frequencies as input
        new_col = np.random.choice(values, nrow, p=probs)
        # d) turn array into pandas series and ensure NAs are displayes correctly
        new_col = pd.Series(new_col).replace('nan', np.nan)
        return new_col
    # 2. Define procedure for numeric variables
    elif(x.name.endswith('numeric')):
         
        def random_numerical_selection(x_mean, x_sd, nrow_non_NA, nrow):
            probs=[nrow_non_NA/nrow, (nrow-nrow_non_NA)/nrow]
            values=[np.random.normal(x_mean,x_sd), np.nan]
            return np.random.choice(values, p=probs)
        
        # a) coerce to numeric 
        x = pd.to_numeric(x, errors = 'coerce')
        is_integer = True if str(x.dtypes) == 'int64' or all(y.is_integer() or pd.isnull(y) for y in x) else False
        # b) get mean and standard deviation
        x_mean = x.mean()
        x_sd = np.std(x)
        # c) simulate data using normal distribution and proportion on NaNs
        not_NAN_vals = nrow-x.isnull().sum()
        new_col = [random_numerical_selection(x_mean, x_sd, nrow_non_NA, nrow) for _ in range(nrow)]
        #new_col = np.random.normal(x_mean, x_sd, nrow_non_NA)
        # d) turn array into pandas series and ensure NAs are displayes correctly
        #new_col = np.pad(new_col, (0,int(nrow_NA)), "constant", constant_values=(np.nan,))
        new_col = pd.Series(new_col).replace('nan', np.nan)
        # e) ensure original format is respected (integer vs float)
        if is_integer:
            new_col = new_col.round(decimals=0)
        #else:
        #    new_col = round(new_col, num_precision)
        # f) check if we have positive or negative values only (and correct if necessary)
        nrow_negative_values = x[x < 0].dropna().shape[0] 
        nrow_positive_values = x[x > 0].dropna().shape[0] 
        if (nrow_negative_values > 0 and nrow_positive_values == 0):
            new_col.loc[(new_col > 0)] = x.max()
        if (nrow_positive_values > 0 and nrow_negative_values == 0):
            new_col.loc[(new_col < 0)] = x.min()
        
        #e) ensure original format is respected (integer vs float)  -- put here or the rounding isn't imposed on some numbers
        if (is_integer == True):
            new_col = new_col.astype(pd.Int64Dtype(), errors='ignore')
        else:
            new_col = round(new_col, numerical_precision)
        return new_col
    # 3. Define procedure for datetime variables
    elif(x.name.endswith('datetime')):
        #a) coerce to datetime
        x = pd.to_datetime(x.apply(str), errors = 'coerce')
        #b) check whether it's date, time or datetime, and has nanoseconds
        just_time = x.dt.time
        no_ns = x.astype('datetime64[s]')
        just_seconds = just_time.equals(no_ns.dt.time)
        to_floor = 's' if just_seconds else None
        just_time = just_time.dropna().astype(str)
        # (if no times were in the original column, pandas will set it to midnight)
        nrows_with_times = just_time[(just_time != '00:00:00')].shape[0] # if this is more than 0, then we have (some) times
        just_date = x.dt.date
        # (if no dates were in the original column, pandas will attach today's date)
        todays_date = date.today() 
        todays_date = '{:%Y-%m-%d}'.format(todays_date)
        just_date = just_date.dropna().astype(str)
        nrows_with_dates = just_date[(just_date != todays_date)].shape[0] # if this is more than 0, then we have (some) dates
        #b) get earliest and latest time points
        t1 = min(x)
        t2 = max(x)
        #c) generate new data using a unform distribution
        new_col = [generate_datetime(min_time = t1, max_time = t2, to_floor=to_floor) for i in range(nrow_non_NA)]
        new_col = pd.Series(new_col)
        # If this condition is met, we have datetime format:
        if (nrows_with_times > 0 and nrows_with_dates > 0): 
            return new_col.replace('nan', np.nan)
        # If this condition is met, we have just times
        elif (nrows_with_times > 0 and nrows_with_dates == 0): 
            new_col = new_col.dt.time
            return new_col.replace('nan', np.nan)
        # Else, we have just dates:
        else:  
            new_col = new_col.dt.date
            return new_col.replace('nan', np.nan)
    # 4. Define procedure for string variables
    elif(x.name.endswith('string')):
        # 1) Compute essential information about the variable
        x = x.astype(str)
        av_character_length = x.dropna().apply(len).mean()
        sd_character_length = x.dropna().apply(len).std()
        max_character_length = x.dropna().apply(len).max()
        min_character_length = x.dropna().apply(len).min()
        # 2) Define a rule for determining whether a pattern exists 
        if (sd_character_length > 0.2*av_character_length):
            # a) Return a placeholder for strings without patterns
            def str_of_len(random_len):
                unit = "sample text "
                output = ''
                while len(output) < random_len:
                    output += unit
                return f'{output[0:random_len-1]}'
            
            def random_length_str(max_len, min_len):
                return str_of_len(np.random.default_rng().integers(min_len, max_len+1))
                                           
            #new_col = ['sample text' for i in range(nrow)]
            new_col = [random_length_str(max_character_length,min_character_length) for i in range(nrow)]
            new_col = pd.Series(new_col).replace('nan', np.nan)
            return new_col
        # If strings are similar in length, we assume that they have some meaningful pattern:
        else: 
            # 3) Split the string column into multiple columns (one symbol per column)
            x = x.dropna().apply(list)
            new_col_names = paste0('position', range(1,(max_character_length+1)))
            x_split = pd.DataFrame(x.tolist(), columns=[new_col_names])
            # 4) Cross tabulate each column and extract values and associated frequencies
            frequencies = x_split.apply(pd.Series.value_counts, normalize=True)
            frequencies = frequencies.fillna(0)
            # 5) Simulate data based on the probability of a certain value showing up in a certain position
            result_df = pd.DataFrame()
            # This for loop loops through all of the columns and draws an nrow number of values from each
            for i in range(0, max_character_length): 
                result_col = np.random.choice(frequencies.axes[0].tolist(), nrow_non_NA, p=frequencies.iloc[:, i].tolist())
                result_col = pd.Series(result_col)
                result_df = pd.concat([result_df, result_col], axis=1, ignore_index = True)
            # 6) Merge individual symbols into one string
            result_df['all_values'] = result_df[result_df.columns[:]].apply(
                lambda x: ''.join(x.dropna().astype(str)), axis=1)
            # 7) Return result as a pandas series object
            new_col = result_df['all_values'] 
            return new_col.replace('nan', np.nan)
    # 5. If no type applies, return error message
    else:
        raise Exception("Error: No variable type specified. Variable types should be specified at the end of the variable name, e.g. 'gender_categorical'.")
    

## 3. Generate Synthetic Data <a name="generate_data"></a>

This section uses the data set and functions specified above in order to generate synthetic data. If you'd like a detailed overview of what happens in each step, please take a look at the previous section, because this is where the features of individual functions are explained in depth. 

We start by loading in the specified data set as a pandas dataframe.

Note that this function tries to infer column names using the first row of the data. If these are not column names (or if you'd like to skip some rows), please write your own code to load in your data. Otherwise, you might accidentally reveal an observation from the origina data set that has been preserved as column names when publishing the synthetic data set.

In [None]:
if MODE=='normal':
    original_data = read_data(file_path)
elif MODE=='sav_file':
    original_data, metadata = pyreadstat.read_sav(file_path)

original_data

In case the data set is a Microsoft Excel file with multiple sheets, the data will have been imported as a **dictionary** that holds all of the sheets as separate data sets. This means we have to loop our code through the sheets. If this is the case, we run the below code (it automatically checks if the data is in dictionary format) and stop executing the script as the code that follows has been written for two dimensional data. You might see a warning 'SystemExit: 0', but this just means the code is working as expected, and your synthetic data set has already been saved in your working directory.

In [None]:
# check if we have a data set in dictionary format
if (type(original_data) == dict):
    # create excel writer to save output
    base = os.path.basename(file_path)
    file_name = os.path.splitext(base)[0] 
    writer = pd.ExcelWriter(file_name + '_synthetic' + '.xlsx') 
    # loop code through each dictionary item (sheet in spreadsheet)
    for sheet in original_data:
        # save original col names
        original_sheet = original_data[sheet]
        original_column_names = list(original_sheet.columns)
        original_column_names = list(map(str, original_column_names))
        #trim trailing white space
        original_sheet = original_sheet.applymap(lambda x: x.strip() if isinstance(x, str) else x)
        #3) Classify and rename columns
        column_types = original_sheet.apply(identify_variable_type)
        column_types = prepend(column_types, '_')
        new_column_names = [i + j for i, j in zip(original_column_names, column_types)] 
        original_sheet.columns = new_column_names
        # Generate new data
        synthetic_data = original_sheet.apply(create_synthetic_data)
        # Rename columns back to original names
        synthetic_data.columns = original_column_names
        # Save sheet to file
        synthetic_data.to_excel(writer, sheet_name = sheet, index = False)
    # Save excel file once all sheets are done    
    writer.save()
    print('Success! Your synthetic data set has been saved as ' + file_name + '_synthetic' + '.xlsx' + ' in your working directory.')
    # Stop executing script
    sys.exit(0)

If the data is **not in dictionary format**, we run code below instead.

We start by saving the original column names. The next step is trimming trailing white space, as this may otherwise prevent correct variable classifiction. We subsequently classify variables according to type (NA, numeric, catgeorical, string or datetime) and append the type to the original column name. This will later be used to pass information about the variable type to the next step. The reason variables are classified and new data generated in two separate steps is to allow for correction and customization of variable types. At the end of this code chunk, you'll see how variables have been classified and you can check whether you're happy with it. 

In [None]:
# save original column names
original_column_names = list(original_data.columns)

# clean column names
#original_data.columns = original_data.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

#trim trailing white space
for column in original_data.columns: # written to optimise memory
    column_data = original_data[[column]].copy(deep=True)
    column_data = column_data.applymap(lambda x: x.strip() if isinstance(x, str) else x)
    #cols = original_data.select_dtypes(['object']).columns
    if column_data[column].dtype == object:
        column_data = column_data.astype(str).apply(lambda x: x.str.strip())
    original_data[column]=column_data
    del [[column_data]] #delete working column from memory
    gc.collect() #garbage collect to make sure it's gone

# Classify and rename columns
column_types = original_data.apply(identify_variable_type)
column_types = prepend(column_types, '_')
new_column_names = [i + j for i, j in zip(list(original_data.columns), column_types)] 
original_data.columns = new_column_names

new_column_names

**Correction of Variable Classification (optional)**

Please check the variable classification shown above. If you are unhappy and would like to change the classification, you can do so below. Simply remove the the hashtags and insert the right column names. 

"column_to_be_renamed" refers to the current name of the column you'd like to rename. This should consist of the original column name with an underscore and a variable type added at the end, e.g.: "county_categorical"

"new_column_name" refers to the new column name. Simply replace the wrong variable type with the correct one. Possible variable types are 'string', 'categorical', 'numeric', 'NA' or 'datetime'. For example, we could change "county_categorical" into "county_string".

In [None]:
# original_data = original_data.rename({"column_to_be_renamed1":"new_column_name1","column_to_be_renamed2":"new_column_name2"}, axis='columns') 
# original_data.columns

In a final step, we generate synthetic data. This data set will have the same column names, structure and number of rows as your original data set. Frequency of NAs values will be preserved (albeit with the introduction of some noise). Values found in columns will be largely similar (e.g. similar structure of string values, similar frequency of categories for categorical values etc). The new data set will have 'synthetic' appended to its file name and saved as a .csv file in your current working directory.

In [None]:
# Generate new data: here we do everything column by column and overwrite original dataframe to save memory
for column in original_data.columns:
    column_data = original_data[[column]].copy(deep=True)
    original_data[column] = column_data.apply(create_synthetic_data)
    del [[column_data]] # delete column data to save memory
    gc.collect() #force garbage collection to clear column data from memory
    
# Rename columns
original_data.columns = original_column_names

# Replace nulls/NaNs with string
for column in original_data.columns:
    if original_data[column].dtype != pd.Int64Dtype():
        original_data[column] = original_data[column].replace(np.nan, null_string)

# Work out file_name from path (if not using a pipeline)
if MODE != 'pipeline':
    base = os.path.basename(file_path)
    file_name = os.path.splitext(base)[0] 

data_location = file_name + '_synthetic'

# save to csv
original_data.to_csv(data_location + '.csv', index = False)
# save to sav format if original is .sav
if MODE == 'sav_file':
    pyreadstat.write_sav(original_data, data_location + '.sav', file_label=metadata.file_label, column_labels=metadata.column_labels,  variable_value_labels=metadata.variable_value_labels,missing_ranges = metadata.missing_ranges, variable_display_width=metadata.variable_display_width, variable_measure=metadata.variable_measure )

original_data.head()

**Please thoroughly inspect your data before sharing it publicly.** While the code has been written to minimise the likelihood of sharing sensitive data, we cannot guarantee this will always be the case. Therefore, we recommend checking for (accidental) correlations between variables, observations from the original data set that might have been reproduced in the synthetic data set by chance, or any other revealing information.

# Licence Information

MIT License

Copyright (c) 2022 Behavioural Insights Team

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.