## Generate Synthetic Data
Many data sources are naturally normally distributed or close to that. To efficiently generate simulated data, points can be randomly sampled from a seed distribution having a mean of 0.0 and standard deviation of 1.0. The sampled points can then be scaled by a specified standard deviation and corrected to a specified mean. The Python Class below is useful for generating such simulated data sets.

JDL 8/7/23

In [1]:
import pandas as pd
import numpy as np

In [18]:
class SyntheticData:
    """
    This class generates normally-distributed synthetic data for simulations
    JDL 8/7/23
    """
    def __init__(self, len_seed_dist):
        
        #Generate a seed distribution of normally distributed data
        self.dist_seed_normal = pd.Series(np.random.normal(loc=0.0, scale=1.0, size=len_seed_dist))
        self.dist_seed_normal.name = "dist_seed_normal"
        
        #Synthetic Data
        self.df_data = None
    
    def scaled_val(self, mean, sdev):
        """
        Pick a random val from seed distribution
        """
        idx = np.random.choice(self.dist_seed_normal.index)
        return mean + (self.dist_seed_normal[idx] * sdev)
    
    def gen_synthetic_data(self, len_data, mean, sdev, digits):
        """
        Generate a specified number of data points with specified mean and standard deviation
        """
        self.df_data = pd.DataFrame(index=range(len_data))
        self.df_data['vals'] = np.nan
        self.df_data['vals'] = self.df_data.apply(lambda row: self.scaled_val(mean, sdev), axis=1)
        self.df_data['vals'] = self.df_data['vals'].round(digits)

In [19]:
sdata = SyntheticData(10000)

### Example: Generate Simulated Density Data

In [39]:
rho_mean, rho_sd = 1.08, 0.03
n_rho_data = 20

sdata.gen_synthetic_data(n_rho_data, rho_mean, rho_sd, 3)
df_density = sdata.df_data
df_density

Unnamed: 0,vals
0,1.106
1,1.023
2,1.104
3,1.052
4,1.023
5,1.061
6,1.065
7,1.065
8,1.05
9,1.058


In [45]:
print('Mean, StdDevn, Min, Max')
df_density.mean()[0], round(df_density.std()[0],4), df_density.min()[0], df_density.max()[0]

Mean, StdDevn, Min, Max


(1.06945, 0.0298, 1.021, 1.125)