This notebook is meant to supplement the analysis of areal_independence_verification.ipynb by instead using normally distributed matrices to simulate surface data.

While the surfaces generated by surface_simulation_generation.ipynb attempt to mimic the surfaces of cells/clusters, for some parameters they are not the best for testing if those parameters are area independence. This is because those simulated surfaces are generally not self-similar; any quarter of the surface and the surface itself are not similar and thus produce different calculations for the same parameter.

Using normally distributed matrices ensures that the surfaces will be approximately self-similar. For parameters that are area-independent, their final result on the matrix and on any quarter of the matrix will be similar. If the results differ by any significant factor, then they may not be area-independent and should be investigated.

This approach fails for certain parameters, since some parameters (particularly S_rw) are extremely sensitive to noise. In these cases, using Igor's simulated surfaces to test for areal independence is preferable.

In [2]:
import os
from pathlib import Path

import numpy as np
import pandas as pd

sys.path.append('../')
from dr.extraction import parallelized_extraction

In [3]:
rg = np.random.default_rng()
# Zoom function
def zoom(arr, quadrant=1):
    """Slices a 2D array to return a particular quadrant of it.

    Args:
        arr (np.array): The MxN 2D array.
        quadrant (int): The quadrant of the image to extract. Quadrant          1 maps to upper left, 2 to upper right, 3 to lower left, and 
        4 or above to lower right.
        
    Returns:
        np.array: A `M // 2` x `N // 2` quadrant of `arr`
    """
    M, N = arr.shape
    fitted = arr
    if quadrant == 1:
        return fitted[:M // 2, :N // 2]
    elif quadrant == 2:
        return fitted[M // 2:, :N // 2]
    elif quadrant == 3:
        return fitted[:M // 2, N // 2:]
    else:
        return fitted[M // 2:, N // 2:]

def generate_surfaces(directory, base_name, n=100, dims=(256, 256), scale=1):
    """Generates normally-distributed matrices and stores their data under a directory.

    Args:
        directory (str): The directory under which to save the files.
        base_name (int): The base filename to save every generated surface under. The `i`th surface will be stored under `{directory}/{base_name}{i}.txt`.
        n (int): The number of surfaces to generate.
        dims ((int, int)): The dimensions that each generated surface should have.
        scale (int): The variance of the normal distribution used to generate each surface.

    Returns:
        None
    """
    if not os.path.isdir(directory):
        os.mkdir(directory)

    DIR_PATH = Path(directory)
    for i in range(n):
        # It turns out that the mean of the normal distribution does not affect
        # the parameter ratios while the variance does. Here, we made the mean a
        # random variable for the sake of robustness, but honestly it doesn't matter
        np.savetxt(DIR_PATH.joinpath(f'{base_name}{i}.txt'), 
                   rg.normal(loc=np.random.randint(100), scale=10, size=dims) * scale)

DIR = '../data/simulated2'
BASE_NAME = 'surface'
# Set N higher for more robust results; will take longer in turn
N = 10

# We can set the scale to something higher, but setting it too high will make it harder
# to see if parameters are area independent, since any differences in a parameter
# calculated on a surface vs on its quarter may be explained either by areal dependence
# or by high variance that makes the surface no longer self-similar
generate_surfaces(DIR, BASE_NAME, N, scale=1)

In [4]:
# Since these surfaces are generated, DELTA has no real significance here
# We set it to 5 / 256 out of convention
DELTA = 5 / 256

# We now extract the parameters from the surfaces we just generated
fnames = [f'../data/{DIR}/{BASE_NAME}{i}.txt' for i in range(N)]
df_full = parallelized_extraction(fnames, DELTA, DELTA, 256, 6)
df_quarter = parallelized_extraction(fnames, DELTA, DELTA, 256, 6, f=zoom)

100%|██████████| 10/10 [00:24<00:00,  2.41s/it]
100%|██████████| 10/10 [00:08<00:00,  1.18it/s]


In [5]:
df_full

Unnamed: 0,S_a,S_q,S_sk,S_ku,S_z,S_10z,S_v,S_p,S_mean,S_sc,...,S_td,S_tdi,S_rw,S_rwi,S_hw,S_fd,S_cl20,S_cl37,S_tr20,S_tr37
0,7.987769,10.014121,0.000872,3.01365,94.16228,249.27335,-45.315281,48.846999,81.051137,19474.618624,...,90.0,0.683106,4.980469,0.883694,0.079055,2.997483,0.013811,0.013811,1.425349,1.425349
1,8.00046,10.010727,0.001498,2.982361,86.474241,232.083527,-45.989092,40.485149,72.001079,19646.630062,...,0.0,0.70992,0.332031,0.8434,0.079055,2.994313,0.013811,0.013811,1.419749,1.419749
2,7.985578,10.022112,0.008088,3.017812,82.882861,243.177291,-40.081279,42.801582,26.048708,19724.515596,...,90.0,0.668941,4.980469,0.730906,0.08033,2.975441,0.013811,0.013811,1.414214,1.414214
3,7.939501,9.94835,-0.008604,2.98697,86.356267,231.555205,-45.520732,40.835535,29.987709,19153.210855,...,90.0,0.692876,1.660156,0.82799,0.079055,2.991896,0.013811,0.013811,1.414214,1.414214
4,7.950961,9.985544,-0.00847,3.030082,90.007142,251.727934,-45.069957,44.937185,90.023438,19562.749291,...,0.0,0.668446,2.490234,0.753131,0.079055,2.988963,0.013811,0.013811,1.419749,1.419749
5,7.974652,9.992025,-0.003063,2.986289,88.799528,244.355025,-42.151824,46.647703,87.968677,19594.46609,...,90.0,0.715242,0.830078,0.874066,0.07782,3.004495,0.013811,0.013811,1.419749,1.419749
6,7.966731,9.982856,0.019587,2.991123,86.475615,241.818582,-38.774739,47.700876,81.985755,19558.333818,...,90.0,0.697427,0.996094,0.798564,0.07782,3.004604,0.013811,0.013811,2.007828,2.007828
7,7.915207,9.939797,-0.020804,3.028661,87.940348,251.974264,-40.749459,47.19089,2.953779,19154.828397,...,90.0,0.663592,0.249023,0.853891,0.07782,3.007136,0.013811,0.013811,1.414214,1.414214
8,7.986206,10.006543,0.01673,3.018138,89.676438,238.765574,-49.337298,40.33914,61.007936,19496.98233,...,0.0,0.698183,2.490234,0.7401,0.07782,3.011549,0.013811,0.013811,1.414214,1.414214
9,7.956541,9.984039,0.016763,3.026769,89.321117,241.970524,-37.03224,52.288877,18.996215,19452.310104,...,0.0,0.734319,1.245117,0.8804,0.07782,3.010673,0.013811,0.013811,1.425349,1.425349


In [6]:
df_quarter

Unnamed: 0,S_a,S_q,S_sk,S_ku,S_z,S_10z,S_v,S_p,S_mean,S_sc,...,S_td,S_tdi,S_rw,S_rwi,S_hw,S_fd,S_cl20,S_cl37,S_tr20,S_tr37
0,8.052431,10.097057,-0.012579,2.998292,80.384742,219.489677,-40.638213,39.746529,81.095501,19682.119503,...,0.0,0.658107,1.668917,0.84035,0.076909,3.00893,0.013811,0.013811,1.414214,1.414214
1,8.035832,10.055471,0.007296,2.961787,82.607239,213.015654,-45.828878,36.77836,71.958572,20040.930274,...,90.0,0.730196,1.252514,0.744817,0.07811,2.979506,0.013811,0.013811,1.43648,1.43648
2,7.988225,10.020847,-0.008853,3.010937,81.352555,220.171156,-40.232273,41.120282,26.035688,19947.659019,...,90.0,0.701209,0.346017,0.774136,0.076909,3.009268,0.013811,0.013811,1.436661,1.436661
3,7.921281,9.91499,0.00723,2.985946,75.6038,216.288205,-38.448556,37.155244,30.059555,19159.203481,...,90.0,0.684283,0.323705,0.780923,0.07811,3.013248,0.013811,0.013811,1.43648,1.43648
4,7.930986,9.976017,-0.016159,3.057046,74.988204,216.681154,-36.822984,38.16522,89.947854,19586.681168,...,90.0,0.705887,0.771365,0.817552,0.076909,3.017063,0.013811,0.013811,1.43648,1.43648
5,7.950458,9.94433,-0.008137,2.980239,80.370344,220.223426,-39.895883,40.474461,88.020244,19497.564408,...,90.0,0.706614,1.668917,0.707263,0.076909,3.00379,0.013811,0.013811,1.414214,1.414214
6,8.029597,10.02754,0.011201,2.967697,77.64132,226.381854,-36.147491,41.49383,81.933881,19664.83385,...,90.0,0.699034,0.456034,0.854711,0.075744,3.016078,0.013811,0.013811,1.43648,1.43648
7,7.955789,9.946858,-0.032527,2.999611,86.234066,218.614866,-39.046103,47.187963,2.915694,18996.066671,...,90.0,0.696914,0.334489,0.867843,0.076909,3.008647,0.013811,0.013811,1.425305,1.425305
8,7.916124,9.928802,0.039934,3.069512,78.560952,231.761408,-38.303679,40.257273,61.05876,19190.259239,...,90.0,0.673309,0.43622,0.804086,0.07811,3.008614,0.013811,0.013811,1.425305,1.425305
9,7.954444,9.966339,0.024959,2.996875,76.531353,222.606845,-36.721961,39.809392,18.896053,19383.85693,...,90.0,0.683884,0.590037,0.837533,0.076909,3.011924,0.013811,0.013811,2.015686,2.015686


The ratios below are how we judge if a parameter is area independent. If the ratio is close to 1, then the calculation of the parameter over a surface and its quarter are roughly the same, and we can claim that the parameter is area independent. If not, then the parameter may not be area independent, but in that case you should also check the standard deviation to see if the difference is some constant factor. If the standard deviation of the ratios is high, then you are likely dealing with some sort of noise and should verify the parameter's areal independence some other way.

Note that some parameters are intrinsically not area independent, like S_2a and S_3a.

In [7]:
((df_full) / (df_quarter)).mean()[:100]

S_a           0.999129
S_q           1.000102
S_sk          0.241171
S_ku          1.001872
S_z           1.112658
S_10z         1.100960
S_v           1.099209
S_p           1.126482
S_mean        1.001657
S_sc          0.998449
S_2a          4.000000
S_3a          4.041402
S_dr          1.010351
S_dq          0.999987
S_dq6         1.000001
S_bi          1.001623
S_ci          0.998517
S_vi          1.026076
S_pk          0.999822
S_vk          1.007157
S_k           0.999230
S_dc0-5       1.217637
S_dc5-10      0.995339
S_dc10-50     1.003842
S_dc50-95     0.999437
S_dc50-100    1.098078
S_ds          1.022777
S_td               inf
S_tdi         0.999651
S_rw          3.724501
S_rwi         1.023525
S_hw          1.018368
S_fd          0.997001
S_cl20        1.000000
S_cl37        1.000000
S_tr20        1.004667
S_tr37        1.004667
dtype: float64