# Synthetic Data Generator

You don't have to run this code but we are providing this for completeness.  The high dimensional synthetic data $\mathbf{C}$ and its non-linear transformation $\mathbf{D}$ is available at https://github.com/nitishbahadur/book_chapter/tree/master/data/synthetic_data/input

How did we create the sythetic dataset?

We created a random array $\mathbf{C} = \mathbf{A} \times \mathbf{B}$ 
 - $\mathbf{A}~\epsilon~\mathbb{R}^{5000\times5}$
 - $\mathbf{B}~\epsilon~\mathbb{R}^{5\times784}$
 
We apply nonlinear transformations to $\mathbf{C}$ to create $\mathbf{D} = g(\mathbf{C})$, where $\mathbf{D}~\epsilon~\mathbb{R}^{5000\times784}$ but has known dimension $5$.

In [23]:
import numpy as np
import time
import os
import random

In [24]:
A = np.random.rand(5000,5)
B = np.random.rand(5, 784)
C = np.matmul(A, B)

Apply non-linear transformation

We create a matrix $\mathbf{D}~\epsilon~\mathbb{R}^{5000\times784}$, by using non-linear functions on columns of $\mathbf{C}~\epsilon~\mathbb{R}^{5000\times784}$ and adding it to another randomly selected column of $\mathbf{C}~\epsilon~\mathbb{R}^{5000\times784}$.

Select the catalog of non-linear functions you want to apply to each column of $\mathbf{C}$.  We can additional non-linear functions to this dictionary

In [25]:
funcdict = {
    'power' : np.power,
    'exp' : np.exp,
    'reciprocal' : np.reciprocal,
    'exp2' : np.exp2
}

In [26]:
'''
The function random generates a non-linear function from the function dictionary above.
This function is applied to column of matrix C

  function_name : function pointer to a numpy non-linear function
'''
def get_nonlinear_fn():
    functions = list(funcdict.keys())
    rand_int = np.random.randint(0, 100)
    function_index = rand_int % len(functions)
    function_name = functions[function_index]
    return function_name

In [27]:
'''
An additional level of randomness is added for power function, 
where even the exponent is randomly selected.
'''
def apply_function(col_data, fn, multiplier, fh):
#     print(f"Going to apply : {fn}")
    func = funcdict[fn]
    if fn == 'power':
        exponent = np.random.rand() * multiplier
        result = func(col_data, exponent)
        fh.write(f"\t The exponent for power function is {exponent}\n")
    elif fn.startswith("log"):
        epsilon = 0.001
        result = func(np.abs(col_data) + epsilon)
    else:
        result = func(col_data)
    return result

'''
This is the main routine that applies a non-linear transformation to each column of input matrix M.
The output matrix D is of the same shape of M.  Additionally, the user can decide that each column 
of matrix D is generated from how many columns from matrix M.  For example, if use_n_cols = 2, which 
is what we used, then to generate first column of D we randomly select 2 columns from M.  Then 
apply 2 random selected non-linear transformation to these columns and finally add the 2 columns to 
generate a column for matrix D.

  M: input data matrix
  use_n_cols: how many columns from matrix M should we use
  fh: a file handler to log, exactly what tranformations were used
'''
def apply_nonlinear_fn(M, use_n_cols, fh):
    functions = list(funcdict.keys())    
    n_cols = M.shape[1]
    n_functions = len(functions)
    iteration_dict = {}
    transformations = []
    D = None

    for col_index in range(n_cols):
        rand_col_index = random.sample(range(n_cols), use_n_cols)
        rand_fun_index = random.sample(range(n_functions), use_n_cols)
        new_col_data = []        
        fh.write(f"Column {col_index} is produced by\n")
        for i in range(use_n_cols):
            multiplier = np.random.randint(1, 5)
            fn = functions[rand_fun_index[i]]
            fh.write(f"\t Applying np.{fn} to Column {rand_col_index[i]}\n")
            if len(new_col_data) == 0:
                new_col_data = apply_function(M[:,rand_col_index[i]], fn, multiplier, fh)
            else:
                new_col_data = new_col_data + apply_function(M[:,rand_col_index[i]], fn, multiplier, fh)    
        fh.write(f"\t Finally, we add the 2 columns to produce Column {col_index}\n")
        if D is None:
            D = new_col_data
        else:
            D = np.column_stack((D, new_col_data))
    D = (D - D.mean(axis=0))/np.std(D, axis=0)
    return D

To run the code, please uncomment it.  

In [28]:
def count_gt_threshold(z, threshold):
    tot = sum(z)
    z_pct = [(i/tot) for i in sorted(z, reverse=True)]
    z_gt_theta = [i for i in z_pct if i >= threshold]
    return len(z_gt_theta)

In [29]:
# if a value is greater than 1% it is not noise
gte_threshold_ = 0.01

In [30]:
U, S, V = np.linalg.svd(C)
c_gte_dim = count_gt_threshold(z = S, threshold = gte_threshold_)

In [31]:
print(f"The linear dimension of original dataset C is {c_gte_dim}")

The linear dimension of original dataset C is 5


You <font color='red'>DON'T</font> need to run this.  We are only providing this for completeness.  If you decide to generate a non-linear dataset whose linear dimension is completely different from "5", as shown above, then you need to change 2 parameters.  First, increase max_iterations, and second change the threshold of linear dimension. In this case we are using 15.

The high dimensional synthetic data $\mathbf{C}$ and its non-linear transformation $\mathbf{D}$ is available at https://github.com/nitishbahadur/book_chapter/tree/master/data/synthetic_data/input

The exact transformations that were used is listed in https://github.com/nitishbahadur/book_chapter/blob/master/data/synthetic_data/input/1586038282841_20.txt

In [32]:
max_iterations = 5
for i in range(max_iterations):
    print(f"Processing {i+1} of {max_iterations}")
    millis = int(round(time.time() * 1000))
    filepath = os.path.join(r'./debug', str(millis)+".txt")
    f = open(filepath, "w")
    D = apply_nonlinear_fn(M=C, use_n_cols=2, fh=f)
    f.close()
    U_d, S_d, V_d = np.linalg.svd(D)
    gte_threshold_= 0.01 # 1%
    gte_dim = count_gt_threshold(z = S_d, threshold = gte_threshold_)
    
    if gte_dim > 15:
        print("Dimension determined by >= {} metric is ==> {}".format(gte_threshold_, gte_dim))
        print()
                
        np.save(r'./debug/C_{}_dim_{}'.format(millis, c_gte_dim), C)
        np.save(r'./debug/D_{}_dim_{}'.format(millis, gte_dim), D)        
    else:
        print("not useful - dimension determined by >= {} metric is ==> {}".format(gte_threshold_, gte_dim))
        print()
        os.remove(filepath)

Processing 1 of 5
not useful - dimension determined by >= 0.01 metric is ==> 14

Processing 2 of 5
not useful - dimension determined by >= 0.01 metric is ==> 13

Processing 3 of 5
not useful - dimension determined by >= 0.01 metric is ==> 14

Processing 4 of 5
not useful - dimension determined by >= 0.01 metric is ==> 14

Processing 5 of 5
not useful - dimension determined by >= 0.01 metric is ==> 14

