In [1]:
import pandas as pd
import numpy as np
from xray_stats import load_process as lp
from xray_stats import df_plotting as dp
from IPython.display import display

## Batch Calculate Soil Statistics
**Summary:** In this Jupyter notebook, we calculate local and coarse statistics for all x-ray datasets and save them to a csv file for subsequent analysis and processing. These metrics include per-horizontal-slice calculations of **mean, median, standard deviation, 5th percentile and 95th percentile** of **sliding-window skewness, sliding-window kurtosis, sliding-window variance, sobel edges, pixel intensity and density**. Take a look at the load_process library (**get_stats** function in particular along with the functions it calls) along with previous notebooks for an overview of these calculations.

First we introduce the function that calculates these metrics, example usage, and an explanation for the choice of parameters used in a precomputed dataset that we use in later analyses. Then, we provide code compiling statistics for all horizontal slices in a scan as well by binned depths.


### Compute per slice metrics

Here we provide code for compiling statistics into a dataframe and saving to a csv file. The full dataset has already been computed and provided in ../data/precomputed_soil_stats.csv. In the following cell we provide an example for computing a subset along with the settings and parameters needed for computation.

In [16]:
stacks = [10] # List of xray scan indices to compute between 1 and 54 corresponding to "Sample ID" in the meta.csv.
winds = [50,150] # List of square window sizes in pixels to use for sliding window computations
denoises = [True] # Denoising is always important for reducing noise that may affect sliding window and sobel computations
tiffskips = 50 # Number of horizontal slices to skip to reduce dataset and computation time.
pixelskips = 10 # The number of rows and columns to skip between sliding window calculations.

datas = []
for stack in stacks:
    # Build the array of tiff indices to sample.
    total_images, tiff_files_sorted, path_to_tiff_stack = lp.get_tiff_stack(stack)
    tiffs = np.arange(0, total_images, tiffskips)
    for tiff in tiffs:
        for wind in winds:
            for denoise in denoises:
                lp.print_and_flush(f"Stack: {stack}, Tiff: {tiff}, Window Size: {wind}, Denoise: {denoise}")
                try:
                    datas.append(lp.get_stats(tiff,stack,wind,pixelskips,denoise,False,tiff_files_sorted, path_to_tiff_stack))
                except:
                    print("something went wrong")
datas = pd.concat(datas, axis=0)
datas = datas.reset_index(drop=True)
datas.to_csv("../data/sample_subset_stats.csv")
print(datas.describe())

       stack_index  tiff_index  block  sub-rep  window_size   skip  \
count        136.0   136.00000  136.0    136.0   136.000000  136.0   
mean          10.0  1675.00000    2.0      1.0   100.000000   10.0   
std            0.0   985.01739    0.0      0.0    50.184844    0.0   
min           10.0     0.00000    2.0      1.0    50.000000   10.0   
25%           10.0   837.50000    2.0      1.0    50.000000   10.0   
50%           10.0  1675.00000    2.0      1.0   100.000000   10.0   
75%           10.0  2512.50000    2.0      1.0   150.000000   10.0   
max           10.0  3350.00000    2.0      1.0   150.000000   10.0   

        skew_mean  skew_median    skew_std     skew_p5  ...   img_median  \
count  136.000000   136.000000  136.000000  136.000000  ...   136.000000   
mean    -0.993965    -1.047691    1.431102   -3.085000  ...  6698.112472   
std      0.564904     0.574676    0.606970    1.015630  ...   605.882269   
min     -2.335720    -2.419290    0.447662   -5.136869  ...  4569

**NOTE ON SCAN (STACK) 30:** The images from these scans are missing the container and outside air pixels, as such one must exclude this scan from per treatment statistical analyses of density. However, it can still be used for sliding window soil heterogeneity statistics.

### Selecting reasonable skip values
Skip parameters for both pixels and tiffs were selected such that calculation time was reduced without largely affecting statistics. Below are some analyses for different skip values. After calculating statistics, we also employ another useful library module included: **df_plotting**. This module simply takes a pandas dataframe and desired columns to use for data filtering and defaults for plotting, and produces an interactive gui for plotting 2D and 3D scatter plots as well as boxplots. This is useful for visualization aspects of exploratory data analysis (EDA).

#### Varying Pixel Skips:

In [4]:
## Varying Pixel Skips - Only affects statistics of sliding window metrics

stacks = [23] # List of xray scan indices to compute between 1 and 54 corresponding to "Sample ID" in the meta.csv.
winds = [50] # List of square window sizes in pixels to use for sliding window computations
denoises = [True] # Denoising is always important for reducing noise that may affect sliding window and sobel computations
tiffskips = 500 # Number of horizontal slices to skip to reduce dataset and computation time.
pixelskips = [5, 10, 20, 50, 100] # The number of rows and columns to skip between sliding window calculations.

datas = []
for stack in stacks:
    # Build the array of tiff indices to sample.
    total_images, tiff_files_sorted, path_to_tiff_stack = lp.get_tiff_stack(stack)
    tiffs = np.arange(0, total_images, tiffskips)
    for tiff in tiffs:
        for wind in winds:
            for pixelskip in pixelskips:
                lp.print_and_flush(f"Stack: {stack}, Tiff: {tiff}, Window Size: {wind}, Pixel Skip: {pixelskip}    ")
                try:
                    datas.append(lp.get_stats(tiff,stack,wind,pixelskip,denoises[0],False,tiff_files_sorted, path_to_tiff_stack))
                except:
                    print("something went wrong")
datas = pd.concat(datas, axis=0)
datas = datas.reset_index(drop=True)


filter_cols = ["stack_index","window_size","tillage","fertilizer","tillage-fertilizer","denoise"]
default_cols = ["skip","kurt_mean","edge_mean","depth","stack_index","tillage"]
gui = dp.build_gui(datas,filter_cols,default_cols)
display(gui)

Stack: 23, Tiff: 3000, Window Size: 50, Pixel Skip: 100    

VBox(children=(HBox(children=(VBox(children=(Dropdown(description='X-axis:', index=9, layout=Layout(width='90%…

**Results of varying how many pixels (image rows/columns) to skip when collecting sliding window calculations:** An analysis on one scan indicates that skipping more than approximately every 20 pixels begins to impact the resulting average sliding window metrics. This is particularly true for a sliding window size of 50x50 pixels. This effect is reduced for larger window sizes, however a window size of 50 yields useful insights, thus, to be safe, we use a piixel skip value of 10.

#### Varying Tiff Skips:

In [9]:
## Varying Tiff Skips - Affects statistics of all metrics

stacks = [23] # List of xray scan indices to compute between 1 and 54 corresponding to "Sample ID" in the meta.csv.
winds = [50] # List of square window sizes in pixels to use for sliding window computations
denoises = [True] # Denoising is always important for reducing noise that may affect sliding window and sobel computations
tiffskips = [10,50,100,300] # Number of horizontal slices to skip to reduce dataset and computation time.
pixelskips = [10] # The number of rows and columns to skip between sliding window calculations.

guis = [];
for tiffskip in tiffskips:
    datas = []
    for stack in stacks:
        # Build the array of tiff indices to sample.
        total_images, tiff_files_sorted, path_to_tiff_stack = lp.get_tiff_stack(stack)
        tiffs = np.arange(0, total_images, tiffskip)
        for tiff in tiffs:
            for wind in winds:
                for pixelskip in pixelskips:
                    lp.print_and_flush(f"Stack: {stack}, Tiff: {tiff}, Window Size: {wind}, Tiff Skip: {tiffskip}    ")
                    try:
                        datas.append(lp.get_stats(tiff,stack,wind,pixelskip,denoises[0],False,tiff_files_sorted, path_to_tiff_stack))
                    except:
                        print("something went wrong")
    datas = pd.concat(datas, axis=0)
    datas = datas.reset_index(drop=True)
    filter_cols = ["stack_index","window_size","tillage","fertilizer","tillage-fertilizer","denoise"]
    default_cols = ["skip","kurt_mean","edge_mean","tiff_index","stack_index","tillage"]
    gui = dp.build_gui(datas,filter_cols,default_cols)
    print(f"Using every {tiffskip} tiffs (horizontal x-ray slices)")
    display(gui)

Using every 10 tiffs (horizontal x-ray slices)kip: 10    


VBox(children=(HBox(children=(VBox(children=(Dropdown(description='X-axis:', index=9, layout=Layout(width='90%…

Using every 50 tiffs (horizontal x-ray slices)kip: 50    


VBox(children=(HBox(children=(VBox(children=(Dropdown(description='X-axis:', index=9, layout=Layout(width='90%…

Using every 100 tiffs (horizontal x-ray slices)ip: 100    


VBox(children=(HBox(children=(VBox(children=(Dropdown(description='X-axis:', index=9, layout=Layout(width='90%…

Using every 300 tiffs (horizontal x-ray slices)ip: 300    


VBox(children=(HBox(children=(VBox(children=(Dropdown(description='X-axis:', index=9, layout=Layout(width='90%…

**Results:** While using very sparse datasets seemed to have minimal impact on metrics averaged across the entire depth (since we are still sampling across the entire depth range regardless), having a tiff skip value of at most 50 is still useful to generate reliable statistics for binned depths.

### Parameters for full dataset:
After some preliminary testing, below are the parameters used to generate statistics on the entire dataset. We collect data on different sliding window sizes, although 50x50 produced useful results. This cell has already been run and with the statistics saved as "../data/precomputed_soil_stats.csv"

In [13]:
stacks = list(range(54))
stacks = [x + 1 for x in stacks]
winds = [50,100,150]
denoises = [True]
tiffskips = 50
pixelskips = 10

datas = []
for stack in stacks:
    # Build the array of tiff indices to sample.
    total_images, tiff_files_sorted, path_to_tiff_stack = lp.get_tiff_stack(stack)
    tiffs = np.arange(0, total_images, tiffskips)
    for tiff in tiffs:
        for wind in winds:
            for denoise in denoises:
                lp.print_and_flush(f"Stack: {stack}, Tiff: {tiff}, Window Size: {wind}, Denoise: {denoise}")
                try:
                    datas.append(lp.get_stats(tiff,stack,wind,pixelskips,denoise,False,tiff_files_sorted, path_to_tiff_stack))
                except:
                    print("something went wrong")
datas = pd.concat(datas, axis=0)
datas = datas.reset_index(drop=True)
datas.to_csv('../data/full_soil_stats.csv')

Stack: 30, Tiff: 0, Window Size: 100, Denoise: Truerue


divide by zero encountered in double_scalars


divide by zero encountered in double_scalars


divide by zero encountered in double_scalars


divide by zero encountered in double_scalars


divide by zero encountered in double_scalars



Stack: 54, Tiff: 3300, Window Size: 150, Denoise: True

With the full dataset now batch calculated and saved, the next two cells generate bulk statistics both by scan, and by scan with depth bins. In these cells, we assume that there are **multiple *window sizes*, only one *pixel skip* value, and *denoise* always True**. If this changes, you must modify these cells to accurately compute bulk statistics. These two cells have already been run with csv's saved as "../data/precomputed_soil_stats_depth_binned.csv" and ""../data/precomputed_soil_stats_compiled.csv" (only one bin encompassing entire depth range). In the next notebook we will begin to explore this data.

In [14]:
### Calculate Bulk Statistics for each x-ray soil core scan across 4 depth bins.

original_df = pd.read_csv('../data/precomputed_soil_stats.csv')
new_df = []

# Drop rows with missing values in the 'depth' column
original_df = original_df.dropna(subset=['depth'])
depth_bins = pd.cut(original_df['depth'], bins=4)
for stack_index, window_size in original_df[['stack_index', 'window_size']].drop_duplicates().itertuples(index=False):
    for depths in depth_bins.drop_duplicates():
        subset_df = original_df[(original_df['stack_index'] == stack_index) & (original_df['window_size'] == window_size) & (depth_bins == depths)]
        subset_df.reset_index()
        # Extract the unique values from the subset_df (assuming they are the same within each subset)
        row = subset_df[['stack_index', 'file_name', 'tillage', 'fertilizer', 'tillage-fertilizer',
                                       'block', 'sub-rep', 'window_size', 'skip', 'denoise']]
        row = row.head(1)
        # Calculate mean, median, and std of specified columns
        for column in ['skew_mean', 'skew_median', 'skew_std', 'skew_p5',
                       'skew_p95', 'kurt_mean', 'kurt_median', 'kurt_std', 'kurt_p5',
                       'kurt_p95', 'vari_mean', 'vari_median', 'vari_std', 'vari_p5',
                       'vari_p95', 'edge_mean', 'edge_median', 'edge_std', 'edge_p5',
                       'edge_p95', 'img_mean', 'img_median', 'img_std', 'img_p5', 'img_p95',
                       'img_mean_norm (g/cm3)', 'img_median_norm (g/cm3)',
                       'img_std_norm (g/cm3)', 'img_p5_norm (g/cm3)', 'img_p95_norm (g/cm3)']:
            row[f'{column}_mean'] = subset_df[f'{column}'].mean()
            row[f'{column}_median'] = subset_df[f'{column}'].median()
            row[f'{column}_std'] = subset_df[f'{column}'].std()
    
        # Calculate mean, max, and min of depth and tiff_index
        row[f'mean_depth'] = subset_df['depth'].mean()
        row[f'max_depth'] = subset_df['depth'].max()
        row[f'min_depth'] = subset_df['depth'].min()
        row[f'mean_tiff_index'] = subset_df['tiff_index'].mean()
        row[f'max_tiff_index'] = subset_df['tiff_index'].max()
        row[f'min_tiff_index'] = subset_df['tiff_index'].min()

        # Append the row to the new_df
        new_df.append(pd.DataFrame(row))

new_df = pd.concat(new_df,axis=0)
new_df.reset_index()
# Export the new_df to CSV
new_df.to_csv('../data/full_soil_stats_depth_binned.csv', index=False)

In [15]:
### Calculate Bulk Statistics for each x-ray soil core scan across 1 single depth bin encompassing all depths.

original_df = pd.read_csv('../data/precomputed_soil_stats.csv')
new_df = []

# Drop rows with missing values in the 'depth' column
original_df = original_df.dropna(subset=['depth'])
#depth_bins = pd.cut(original_df['depth'], bins=4)
for stack_index, window_size in original_df[['stack_index', 'window_size']].drop_duplicates().itertuples(index=False):
    subset_df = original_df[(original_df['stack_index'] == stack_index) & (original_df['window_size'] == window_size)]
    subset_df.reset_index()
    # Extract the unique values from the subset_df (assuming they are the same within each subset)
    row = subset_df[['stack_index', 'file_name', 'tillage', 'fertilizer', 'tillage-fertilizer',
                                   'block', 'sub-rep', 'window_size', 'skip', 'denoise']]
    row = row.head(1)
    # Calculate mean, median, and std of specified columns
    for column in ['skew_mean', 'skew_median', 'skew_std', 'skew_p5',
                   'skew_p95', 'kurt_mean', 'kurt_median', 'kurt_std', 'kurt_p5',
                   'kurt_p95', 'vari_mean', 'vari_median', 'vari_std', 'vari_p5',
                   'vari_p95', 'edge_mean', 'edge_median', 'edge_std', 'edge_p5',
                   'edge_p95', 'img_mean', 'img_median', 'img_std', 'img_p5', 'img_p95',
                   'img_mean_norm (g/cm3)', 'img_median_norm (g/cm3)',
                   'img_std_norm (g/cm3)', 'img_p5_norm (g/cm3)', 'img_p95_norm (g/cm3)']:
        row[f'{column}_mean'] = subset_df[f'{column}'].mean()
        row[f'{column}_median'] = subset_df[f'{column}'].median()
        row[f'{column}_std'] = subset_df[f'{column}'].std()

    # Calculate mean, max, and min of depth and tiff_index
    row[f'mean_depth'] = subset_df['depth'].mean()
    row[f'max_depth'] = subset_df['depth'].max()
    row[f'min_depth'] = subset_df['depth'].min()
    row[f'mean_tiff_index'] = subset_df['tiff_index'].mean()
    row[f'max_tiff_index'] = subset_df['tiff_index'].max()
    row[f'min_tiff_index'] = subset_df['tiff_index'].min()

    # Append the row to the new_df
    new_df.append(pd.DataFrame(row))

new_df = pd.concat(new_df,axis=0)
new_df.reset_index()
# Export the new_df to CSV
new_df.to_csv('../data/full_soil_stats_compiled.csv', index=False)