In [1]:
## Fundamental Libraries
import pandas as pd
import numpy as np
from xray_stats import load_process as lp
from xray_stats import df_plotting as dp
from IPython.display import display
import plotly.express as px
import time

## ML Libraries
from sklearn.feature_selection import mutual_info_classif
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

## Data analysis of statistical soil metrics
In this notebook, we begin to analyze the soil metrics computed in the previous notebook. We use the provided df_plotting library, which takes a Pandas dataframe and generates an interactive plotting widget to explore the data, allowing the user filter and plot by different dataframe columns, with 2d and 3d scatter as well as box plots available.

The goal was to understand how these metrics were influenced by soil treatment. Below are three cells, one to plot metrics for individual horizontal slices, one to plot metrics averaged across all computed slices of a soil core, and one to plot metrics averaged by 4 depth bins.

Keep in mind that three different sliding window sizes were calculated for these metrics (as per the provided precomputed csv's and their settings described in the previous notebook), so make sure to filter by window size (either 50, 100 or 150).

In [2]:
## Soil metrics for each computed horizontal slice
df = pd.read_csv('../data/precomputed_soil_stats.csv')
filter_cols = ["stack_index","tiff_index","window_size","tillage","fertilizer","tillage-fertilizer","depth"]
default_cols = ["skew_mean","kurt_mean","edge_mean","tillage-fertilizer","tiff_index","stack_index"]
gui = dp.build_gui(df,filter_cols,default_cols)
display(gui)

VBox(children=(HBox(children=(VBox(children=(Dropdown(description='X-axis:', index=12, layout=Layout(width='90…

In [6]:
## Soil metrics averaged across all slices (depths) of each soil core x-rayed
df = pd.read_csv('../data/precomputed_soil_stats_compiled.csv')
filter_cols = ["stack_index","window_size","tillage","fertilizer","tillage-fertilizer"]
default_cols = ["skew_mean_mean","kurt_mean_mean","edge_mean_mean","tillage-fertilizer","stack_index","tillage"]
gui = dp.build_gui(df,filter_cols,default_cols)
display(gui)

VBox(children=(HBox(children=(VBox(children=(Dropdown(description='X-axis:', index=10, layout=Layout(width='90…

In [3]:
## Soil metrics averaged within 4 depth bins for each soil core x-rayed
df = pd.read_csv('../data/precomputed_soil_stats_depth_binned.csv')
filter_cols = ["stack_index","tiff_index","window_size","tillage","fertilizer","tillage-fertilizer","min_depth","min_tiff_index"]
default_cols = ["skew_mean_mean","kurt_mean_mean","edge_mean_mean","tillage-fertilizer","min_depth","stack_index"]
gui = dp.build_gui(df,filter_cols,default_cols)
display(gui)

VBox(children=(HBox(children=(VBox(children=(Dropdown(description='X-axis:', index=10, layout=Layout(width='90…

**Streamlit version:** These metrics (individual horizontal slices and depth-binned stats) are also viewable through a streamlit app. You can run that app by either running the below command here in the Jupyter cell, running the command (without !) in a commandline terminal (be sure to cd to the top level directory of this analysis package), or visit www... Be sure to have streamlit installed if running locally (pip install streamlit).

In [None]:
!streamlit run ../streamlit/full_data_heterogeneity_vis_1.py

### Initial Observations
When plotting by **kurtosis, skewness and sobel edges**, there are some striking visual differences between the different treatments. Native praire is particularly distinct in this space from the other treatments. And high manure vs high fertilizer shows some significant visual seperation. Additionally, most data seems to fall on a curved surface, most evident when plotting all horizontal slices.

*To get an intuition of what different regions of this 3D space look like, I wrote a script that interactively lets you select individual plotted points of horizontal slices and see the x-ray image of that slice. These select slices have been predecimated and saved as a part of this analysis package (so no need to access the entire raw x-ray dataset). This code is provided in the **N4b_Soil_Metrics_vs_XRay_Images notebook**.*

## Influential Metrics and Feature Selection
In this section, our objective is to identify the statistical metrics that most effectively differentiate soil treatments, including tillage and fertilizer applications. To achieve this, we perform feature engineering and prioritize the most informative metrics. Specifically, we focus on three key metrics: average kurtosis, skewness, and sobel edges, which exhibit noticeable visual differences among treatments.

### Feature Engineering
We begin by evaluating the impact of these three metrics on distinguishing between treatments. While there are numerous available metrics (which can be explored visually using the provided plotting tools), we use mutual information to quantify the relationship between these metrics and the treatments.

**Mutual Information:** Mutual information measures the degree of information shared between two variables. In our context, it helps us understand how informative each metric is in classifying soil treatments. Higher mutual information values indicate stronger discrimination potential.

### Data Preparation
To apply mutual information, we need to structure our data appropriately. For this exercise, we choose the depth-bin-averaged dataset as the basis. We extract all available metrics, creating a feature for each combination of metric and depth (rather than using depth bin as a feature itself). This allows us to capture the nuances of how metrics vary with depth, which can be essential for accurate classification.

In [115]:
def get_features(i):
    if i in [0, 1, 2, 3]:
        # Depth Binned Averages
        df = pd.read_csv('../data/precomputed_soil_stats_depth_binned.csv')
    elif i in [4, 5]:
        # Individual Depths (Horizontal X-Ray Slices)
        df = pd.read_csv('../data/precomputed_soil_stats.csv')
        
    # Common filtering conditions
    common_conditions = (df['window_size'] == 50) & (df['denoise'] == True)
    df = df[common_conditions]
    
    # Simple feature refactoring
    if i in [0, 1, 4, 5]:
        df.reset_index(inplace=True)
        if i == 0:
            X = df[["skew_mean_mean","kurt_mean_mean","edge_mean_mean"]]
        elif i == 1:
            X = df[["skew_mean_mean","kurt_mean_mean","edge_mean_mean","min_depth"]]
        elif i == 4:
            X = df[["skew_mean","kurt_mean","edge_mean"]]
        elif i == 5:
            X = df[["skew_mean","kurt_mean","edge_mean","depth"]]
        y = df["tillage-fertilizer"]
        scan = df['stack_index']
        
    # Metric x Depth Bin Combination features
    elif i in [2, 3]:
        if i == 2:
            X = df[["stack_index","skew_mean_mean","kurt_mean_mean","edge_mean_mean","min_depth","tillage-fertilizer"]]
        elif i == 3:
            X = df[['stack_index','tillage-fertilizer','skew_mean_mean', 'skew_mean_median', 'skew_mean_std', 'skew_median_mean', 'skew_median_median', 'skew_median_std', 'skew_std_mean', 'skew_std_median', 'skew_std_std', 'skew_p5_mean', 'skew_p5_median', 'skew_p5_std', 'skew_p95_mean', 'skew_p95_median', 'skew_p95_std', 'kurt_mean_mean', 'kurt_mean_median', 'kurt_mean_std', 'kurt_median_mean', 'kurt_median_median', 'kurt_median_std', 'kurt_std_mean', 'kurt_std_median', 'kurt_std_std', 'kurt_p5_mean', 'kurt_p5_median', 'kurt_p5_std', 'kurt_p95_mean', 'kurt_p95_median', 'kurt_p95_std', 'vari_mean_mean', 'vari_mean_median', 'vari_mean_std', 'vari_median_mean', 'vari_median_median', 'vari_median_std', 'vari_std_mean', 'vari_std_median', 'vari_std_std', 'vari_p5_mean', 'vari_p5_median', 'vari_p5_std', 'vari_p95_mean', 'vari_p95_median', 'vari_p95_std', 'edge_mean_mean', 'edge_mean_median', 'edge_mean_std', 'edge_median_mean', 'edge_median_median', 'edge_median_std', 'edge_std_mean', 'edge_std_median', 'edge_std_std', 'edge_p5_mean', 'edge_p5_median', 'edge_p5_std', 'edge_p95_mean', 'edge_p95_median', 'edge_p95_std', 'img_mean_mean', 'img_mean_median', 'img_mean_std', 'img_median_mean', 'img_median_median', 'img_median_std', 'img_std_mean', 'img_std_median', 'img_std_std', 'img_p5_mean', 'img_p5_median', 'img_p5_std', 'img_p95_mean', 'img_p95_median', 'img_p95_std', 'img_mean_norm (g/cm3)_mean', 'img_mean_norm (g/cm3)_median', 'img_mean_norm (g/cm3)_std', 'img_median_norm (g/cm3)_mean', 'img_median_norm (g/cm3)_median', 'img_median_norm (g/cm3)_std', 'img_std_norm (g/cm3)_mean', 'img_std_norm (g/cm3)_median', 'img_std_norm (g/cm3)_std', 'img_p5_norm (g/cm3)_mean', 'img_p5_norm (g/cm3)_median', 'img_p5_norm (g/cm3)_std', 'img_p95_norm (g/cm3)_mean', 'img_p95_norm (g/cm3)_median', 'img_p95_norm (g/cm3)_std', 'mean_depth', 'min_depth']]
        X = X.pivot(index='stack_index', columns='min_depth')
        X.columns = [f"{col[0]}_{col[1]}" for col in X.columns]
        X.reset_index(inplace=True)
        scan = X['stack_index']
        X = X.drop('stack_index',axis=1)
        X = X.drop(columns=[col for col in X.columns if col.startswith('tillage-fertilizer') and col != 'tillage-fertilizer_0.0'])
        X = X.rename(columns={'tillage-fertilizer_0.0': 'tillage-fertilizer'})      
        y = X['tillage-fertilizer']
        X = X.drop('tillage-fertilizer',axis=1)

    else:
        raise ValueError("Invalid value of 'i'")
    
    return X, y, scan

In [116]:
## Explore Mutual Information from all depth-binned features
X,y,scan = get_features(3)

# Ignore scan 30 since we are considering all metrics (including density which, has infs for scan 30)
X = X[scan!=30]
X.reset_index(inplace=True)

y = y[scan!=30]

# There are a few ways to split the soil classifications:
# Either treat each tillage-fertilizer combination as its own target treatment (y)
# Or evaluate each treatment type individually (yt and yf):
y_split = y.str.split("-", expand=True)

# Create two separate arrays for tillage and fertilizer
yt = y_split[0] # Tillage treatment
yf = y_split[1] # Fertilizer treatment

# For this exercise we will compute the Mutual Information scores for each feature using 
# each tillage-fertilizer combination as its own target treatment (y)
mi_scores = mutual_info_classif(X, y)

# Create a DataFrame to display the results
feature_importance = pd.DataFrame({'Feature': X.columns, 'Mutual_Information': mi_scores})
feature_importance = feature_importance.sort_values(by='Mutual_Information', ascending=False)
feature_importance.reset_index(inplace=True)
feature_importance = feature_importance.drop('index',axis=1)
feature_importance = feature_importance[1:]
# Plot the feature importance using Plotly Express
fig = px.bar(feature_importance, x='Feature', y='Mutual_Information',
             labels={'Feature': 'Feature', 'Mutual_Information': 'Mutual Information'},
             title='Feature Importance based on Mutual Information',
             height=400)

fig.update_layout(xaxis_tickangle=-45, xaxis=dict(type='category', tickfont=dict(size=8)),
                  xaxis_title_font=dict(size=12), yaxis_title_font=dict(size=12),
                  title_font=dict(size=14))

fig.show()

# Evaluate metrics of interest
depths = ['0.0','3.51','6.824999999999999','10.14']
metrics = ['skew','kurt','edge']
feats = []
totfeats = len(feature_importance)
print("Percentile Ranking of Feature Importance (mutual information) for mean_mean features of skew, kurt and sobel edges")
for metric in metrics:
    for depth in depths:
        feats.append(metric+'_mean_mean_'+depth)
        feature = metric+'_mean_mean_'+depth
        ranking = str((totfeats-feature_importance[feature_importance['Feature'] == feature].index[0])/totfeats*100)
        print(feature + ":" + ranking)


Percentile Ranking of Feature Importance (mutual information) for mean_mean features of skew, kurt and sobel edges
skew_mean_mean_0.0:70.6043956043956
skew_mean_mean_3.51:93.13186813186813
skew_mean_mean_6.824999999999999:94.5054945054945
skew_mean_mean_10.14:73.35164835164835
kurt_mean_mean_0.0:62.08791208791209
kurt_mean_mean_3.51:71.97802197802197
kurt_mean_mean_6.824999999999999:90.38461538461539
kurt_mean_mean_10.14:47.527472527472526
edge_mean_mean_0.0:95.6043956043956
edge_mean_mean_3.51:78.02197802197803
edge_mean_mean_6.824999999999999:95.32967032967034
edge_mean_mean_10.14:61.26373626373627


Most mean_mean features of skew kurt and edge (8 of 12) were above the 70th percentile in ranking of feature importance based on mutual information. As reminder, mean_mean indicates that the metric (like skew or kurt) was averaged across the entire soil core image of that horizontal slice, and that average was averaged across a particular depth bin for a particular x-ray soil sample scan. There are many other options that performed similarly, and some that had particularly high mutual information, such as *skew_p95_median_6.825* (or the median value for the 95th percentile skewness for horizontal slices within the 3 deepest depth bin of each scan - which ranges between about 6.8 and 10 cm deep).

For the purposes for the following analyses, we will continue using mean_mean metrics of skew kurt and edge for each of the four depth bins.

## Differentiating treatments by select features
Now that we've taken a preliminary look how informative individual features may be at classifying soil treatments, we will now evaluate how effective select features are for classification. We will use two techniques: dimensionality reduction (t-SNE) and modelling (SVM).

### Dimensionality reduction
Here we look at how our 12-feature data set (4 depth bins x 3 metrics) can be reduced to two dimensions and how the different treatments cluster. Principal Component Analysis (PCA) is great for data that has mostly linear relationships. However, our data is rather complex; even for a single depth bin, the points lie on a curved 3D manifold rather than a cluster that resembles a line or plane. I instead used **t-distributed stochastic neighbor embedding (t-SNE)**, a non-linear dimensionality reduction technique, to map data from 12 features to 2 components. The goal was to understand how the soil-types seperate and characterize those differences.

In [117]:
# Collect features
X,y,scan = get_features(2)
y_split = y.str.split("-", expand=True)
yt = y_split[0]
yf = y_split[1]

# Perform TSNE transformation - better results are indicated by a lower Kullback-Leibler (KL) divergence
tsne = TSNE(n_components=2, perplexity = 50, learning_rate = 'auto', n_iter = 10000, random_state=42)
X_tsne = tsne.fit_transform(X)
print(f"Kullback-Leibler (KL) divergence: {tsne.kl_divergence_}")

# Plot TSNE variables colored and markered by soil treatement
custom_symbols = ['circle', 'square', 'cross']
fig = px.scatter(x=X_tsne[:, 0], y=X_tsne[:, 1], color=yt, size=np.full(len(X_tsne), 1),size_max=15,symbol=yf,symbol_sequence=custom_symbols)
fig.update_layout(
    title="t-SNE visualization of x-ray stats by tillage and fertilizer",
    xaxis_title="First t-SNE",
    yaxis_title="Second t-SNE",
    xaxis=dict(showgrid=True, gridcolor='lightgray'),  # Customize x-axis grid lines
    yaxis=dict(showgrid=True, gridcolor='lightgray'),  # Customize y-axis grid lines
    plot_bgcolor='white'  # Set the background color to white 
)
fig.update_traces(marker=dict(opacity=1, line=dict(width=1)))
fig.show()

# Influence of fertilizer and tillage on both tsne variables
df = pd.DataFrame({'x': X_tsne[:, 0],
                   'y': X_tsne[:, 1], 
                   'fertilizer': yf,
                   'tillage': yt})
modelx = ols('x ~ C(fertilizer) + C(tillage) + C(fertilizer):C(tillage)', data=df).fit()
modely = ols('y ~ C(fertilizer) + C(tillage) + C(fertilizer):C(tillage)', data=df).fit()

# Perform ANOVA test
anovax = anova_lm(modelx)
anovay = anova_lm(modely)
print('ANOVA ON FIRST TSNE')
print(anovax)
print('ANOVA ON SECOND TSNE')
print(anovay)


The default initialization in TSNE will change from 'random' to 'pca' in 1.2.



Kullback-Leibler (KL) divergence: 0.0020751922857016325


ANOVA ON FIRST TSNE
                            df    sum_sq   mean_sq          F        PR(>F)
C(fertilizer)              2.0  2.807664  1.403832  54.116403  3.930940e-13
C(tillage)                 2.0  0.033339  0.016670   0.642598  5.302983e-01
C(fertilizer):C(tillage)   4.0  0.074524  0.018631   0.718211  5.835906e-01
Residual                  49.0  1.271108  0.025941        NaN           NaN
ANOVA ON SECOND TSNE
                            df    sum_sq   mean_sq          F        PR(>F)
C(fertilizer)              2.0  4.679725  2.339863  25.070679  3.173656e-08
C(tillage)                 2.0  2.769315  1.384658  14.836045  9.166363e-06
C(fertilizer):C(tillage)   4.0  0.052038  0.013009   0.139391  9.668077e-01
Residual                  49.0  4.573202  0.093331        NaN           NaN


These results differ slightly from those presented in the accompanying slides due to inclusion here of scan 30, but the conclusions are the same: **t-SNE components illustrate strongly separated clusters by soil treatment.**

ANOVA: 
- Significant differences in t-SNE_1 by fertilizer (P ~ 1E-13)
- Significant differences in t-SNE_2 by both fertilizer (P ~ 1E-8), and tillage (P ~ 1.8E-6)


While t-SNE is excellent for visualizing clusters, a downside is no direct or deterministic mapping between original features and t-SNE components, which can make it tricky to draw to many conclusions beyond the visualizations of clusters one can make with t-SNE outputs. But we can still observe strong correlations for different components:

- t-SNE_1 - strongly correlated with skew and kurtosis (which seem negatively correlated with each other).
- t-SNE_2 - strongly (negatively) correlated with sobel edge features.

See the correlation coefficients below.

In [118]:
Xall = X.copy();
Xall['tsne_1'] = X_tsne[:, 0]
Xall['tsne_2'] = X_tsne[:, 1]
correlation_matrix = Xall.corr()

# Sort the correlations in descending order to find the most influential variables for tsne_1
sorted_correlation_tsne_1 = correlation_matrix['tsne_1'].sort_values(ascending=False)

# Sort the correlations in descending order to find the most influential variables for tsne_2
sorted_correlation_tsne_2 = correlation_matrix['tsne_2'].sort_values(ascending=False)

# Print the sorted correlations
print("Most influential variables for tsne_1:")
print(sorted_correlation_tsne_1)

print("\nMost influential variables for tsne_2:")
print(sorted_correlation_tsne_2)

# Plot sample features 
fig = px.scatter(x=X['edge_mean_mean_3.51'], y=X['kurt_mean_mean_6.824999999999999'], color=yt, size=np.full(len(X_tsne), 1),size_max=15,symbol=yf,symbol_sequence=custom_symbols)
fig.update_layout(
    title="Sample features",
    xaxis_title="mean-mean sobel edges 3.51-6.83 cm depth",
    yaxis_title="mean-mean kurtosis 6.83-10.14 cm depth",
    xaxis=dict(showgrid=True, gridcolor='lightgray'),  # Customize x-axis grid lines
    yaxis=dict(showgrid=True, gridcolor='lightgray'),  # Customize y-axis grid lines
    plot_bgcolor='white'  # Set the background color to white 
)
fig.update_traces(marker=dict(opacity=1, line=dict(width=1)))
fig.show()

Most influential variables for tsne_1:
tsne_1                              1.000000
skew_mean_mean_3.51                 0.860489
skew_mean_mean_0.0                  0.820958
skew_mean_mean_6.824999999999999    0.802489
skew_mean_mean_10.14                0.640786
edge_mean_mean_3.51                 0.205635
edge_mean_mean_6.824999999999999    0.133382
edge_mean_mean_10.14               -0.020322
tsne_2                             -0.024127
edge_mean_mean_0.0                 -0.564315
kurt_mean_mean_0.0                 -0.576362
kurt_mean_mean_3.51                -0.717262
kurt_mean_mean_10.14               -0.799254
kurt_mean_mean_6.824999999999999   -0.805760
Name: tsne_1, dtype: float64

Most influential variables for tsne_2:
tsne_2                              1.000000
kurt_mean_mean_3.51                 0.566635
kurt_mean_mean_6.824999999999999    0.457974
kurt_mean_mean_0.0                  0.422908
skew_mean_mean_10.14                0.327605
skew_mean_mean_6.824999999999999    0

**PCA:** While the t-SNE was chosen over PCA given the slight non-linearity of the data, PCA can produce similar clustering on decomposition:

In [52]:
pca = PCA(n_components=2)
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
X_pca = pca.fit_transform(X_scaled)
fig = px.scatter(x=X_pca[:, 0], y=X_pca[:, 1], color=yt, size=np.full(len(X_pca), 1),size_max=15,symbol=yf,symbol_sequence=custom_symbols)

fig.update_layout(
    title="PCA visualization of x-ray stats by tillage and fertilizer",
    xaxis_title="First pca",
    yaxis_title="Second pca",
    xaxis=dict(showgrid=True, gridcolor='lightgray'),  # Customize x-axis grid lines
    yaxis=dict(showgrid=True, gridcolor='lightgray'),  # Customize y-axis grid lines
    plot_bgcolor='white'  # Set the background color to white 
)
fig.update_traces(marker=dict(opacity=1, line=dict(width=1)))
fig.show()
df = pd.DataFrame({'x': X_pca[:, 0],
                   'y': X_pca[:, 1], 
                   'fertilizer': yf,
                   'tillage': yt})
modelx = ols('x ~ C(fertilizer) + C(tillage) + C(fertilizer):C(tillage)', data=df).fit()
modely = ols('y ~ C(fertilizer) + C(tillage) + C(fertilizer):C(tillage)', data=df).fit()

# Perform ANOVA test
anovax = anova_lm(modelx)
anovay = anova_lm(modely)
print('ANOVA ON FIRST PCA')
print(anovax)
print('ANOVA ON SECOND PCA')
print(anovay)
component_mapping = pca.components_

# Create a DataFrame to display the table
component_mapping_table = pd.DataFrame(component_mapping, columns=X.columns, index=['PCA 1', 'PCA 2'])

print("Component Mapping Table:")
print(component_mapping_table)

ANOVA ON FIRST PCA
                            df     sum_sq    mean_sq          F        PR(>F)
C(fertilizer)              2.0  98.162052  49.081026  32.930505  8.621336e-10
C(tillage)                 2.0  10.764630   5.382315   3.611219  3.443633e-02
C(fertilizer):C(tillage)   4.0   0.344747   0.086187   0.057826  9.935925e-01
Residual                  49.0  73.031686   1.490443        NaN           NaN
ANOVA ON SECOND PCA
                            df     sum_sq    mean_sq          F        PR(>F)
C(fertilizer)              2.0  80.454268  40.227134  53.576474  4.653979e-13
C(tillage)                 2.0  13.168624   6.584312   8.769310  5.552595e-04
C(fertilizer):C(tillage)   4.0   5.510975   1.377744   1.834947  1.371611e-01
Residual                  49.0  36.790954   0.750836        NaN           NaN
Component Mapping Table:
       skew_mean_mean_0.0  skew_mean_mean_3.51  \
PCA 1            0.236149             0.343760   
PCA 2            0.165926             0.224527   

     

### Classification with SVM model
Here we evaluate the ability of our selected metrics to be good predictors for classifying soil treatment using a support vector machine (SVM). An SVM model is a useful classification tool for this dataset because it can effectively handle high-dimensional feature spaces, making it suitable for scenarios where there are numerous features or metrics. Additionally, SVMs are known for their versatility in handling both linear and non-linear classification tasks, allowing us to capture complex relationships within the data. 

In [64]:
def evaluate_svn(train, cols, target, ids):
    X = train[cols]
    y = target
    
    folds = 5
    kfold = StratifiedGroupKFold(n_splits=folds, shuffle=True, random_state=1)
    scores = np.full((folds), np.nan)
    
    i = 0
    execution_time = 0
    
    for train_ix, valid_ix in kfold.split(X, y, ids):
        start_time = time.perf_counter()
        X_train, X_valid = X.iloc[train_ix], X.iloc[valid_ix]
        
        # Normalize each row using StandardScaler
        scaler = RobustScaler()
        X_train = scaler.fit_transform(X_train)
        X_valid = scaler.transform(X_valid)
        
        y_train, y_valid = y[train_ix], y[valid_ix]
        # Create the SVM classifier with the OvR strategy (default in scikit-learn for multi-class classification)
        svm_classifier = SVC(kernel='linear', C=1)

        # Train the classifier on the training data
        svm_classifier.fit(X_train, y_train)

        # Make predictions on the test data
        y_pred = svm_classifier.predict(X_valid)
        
        # Calculate the accuracy of the classifier
        score = accuracy_score(y_valid, y_pred)
        #print("Fold " + str(i) + ": {:.2f}".format(score))
        scores[i] = score
        
        end_time = time.perf_counter()
        execution_time += (end_time - start_time)
        i = i+1
    
    
    execution_time /= folds
    scoresmean = np.nanmean(scores, axis=0)
    return scores, scoresmean, execution_time

In [76]:
# Collect features
X,y,scan = get_features(2)
y_split = y.str.split("-", expand=True)
yt = y_split[0]
yf = y_split[1]

scorestf, scoresmeantf, execution_time = evaluate_svn(X,X.columns,y,scan)
scorest, scoresmeant, execution_time = evaluate_svn(X,X.columns,yt,scan)
scoresf, scoresmeanf, execution_time = evaluate_svn(X,X.columns,yf,scan)

scores = {'Fold': ['1', '2', '3', '4', '5', 'Average'],
        'Tillage-Fertilizer Classification': np.append(scorestf, scoresmeantf),
        'Tillage Classification': np.append(scorest, scoresmeant),
        'Fertilizer Classification': np.append(scoresf, scoresmeanf)}
df_scores = pd.DataFrame(scores)
df_scores

Unnamed: 0,Fold,Tillage-Fertilizer Classification,Tillage Classification,Fertilizer Classification
0,1,0.818182,0.909091,0.818182
1,2,0.727273,0.909091,0.909091
2,3,1.0,0.9,0.818182
3,4,0.727273,0.818182,0.909091
4,5,0.909091,0.727273,0.9
5,Average,0.836364,0.852727,0.870909


To determine the accuracy of the SVM model, I used cross-validation with 5 folds (for each fold, the model was trained on 80% of the scans to predict the soil type of other 20% of the scans), and compared the accuracy of the predictions.
I used stratified group k-folds to preserve the distribution of each soil type in the folds. **Accuracy = total correct predictions / total predictions**. While the results differ slightly than what is presented in the accompanying slides, the conclusion is the same: **Skew, Kurtosis, and Sobel edge seem to be good metrics to differentiate soils.** Slight improvements can be observed in fertilizer classification accuracy by using some of the features with the highest mutual information, although accuracy is reduced when classifying by tillage or tillage-fertilizer.

In [134]:
a = np.array(feature_importance.loc[[1,2,3,5,6,7,9,10,11,12,13,14],'Feature'])

# Collect features
X,y,scan = get_features(3)

y_split = y.str.split("-", expand=True)
yt = y_split[0]
yf = y_split[1]

scorestf, scoresmeantf, execution_time = evaluate_svn(X,a,y,scan)
scorest, scoresmeant, execution_time = evaluate_svn(X,a,yt,scan)
scoresf, scoresmeanf, execution_time = evaluate_svn(X,a,yf,scan)

scores = {'Fold': ['1', '2', '3', '4', '5', 'Average'],
        'Tillage-Fertilizer Classification': np.append(scorestf, scoresmeantf),
        'Tillage Classification': np.append(scorest, scoresmeant),
        'Fertilizer Classification': np.append(scoresf, scoresmeanf)}
df_scores = pd.DataFrame(scores)
print(a)
df_scores

['skew_p95_median_6.824999999999999' 'kurt_p5_median_6.824999999999999'
 'edge_median_mean_0.0' 'skew_p95_median_3.51' 'skew_p95_mean_3.51'
 'skew_p95_mean_6.824999999999999' 'kurt_p95_median_6.824999999999999'
 'kurt_p95_mean_6.824999999999999' 'edge_p95_mean_0.0'
 'edge_p95_median_3.51' 'edge_std_mean_0.0' 'edge_median_median_0.0']


Unnamed: 0,Fold,Tillage-Fertilizer Classification,Tillage Classification,Fertilizer Classification
0,1,0.909091,0.818182,1.0
1,2,0.818182,1.0,1.0
2,3,0.8,1.0,0.909091
3,4,0.727273,0.636364,1.0
4,5,0.727273,0.636364,0.9
5,Average,0.796364,0.818182,0.961818


There are a variety of combinations of metrics that can result in similarly predictive classifications. However, as was observed during the dimensionality reduction analysis, it's likely that most of the useful information about the metrics aree shared across the various metrics, and that only a few select metrics at certain depths are necessary for good/representative classification predictions.

In the below cell, we observe the correlations between the top metrics as determined by mutual information with tillage-fertilizer (analyzed earlier in this notebook).

In [135]:
X,y,scan = get_features(3)
X = X[feature_importance.loc[0:15,'Feature']]
correlation_matrix = abs(X.corr())
correlation_matrix

Unnamed: 0,skew_p95_median_6.824999999999999,kurt_p5_median_6.824999999999999,edge_median_mean_0.0,img_std_norm (g/cm3)_mean_6.824999999999999,skew_p95_median_3.51,skew_p95_mean_3.51,skew_p95_mean_6.824999999999999,img_std_norm (g/cm3)_median_6.824999999999999,kurt_p95_median_6.824999999999999,kurt_p95_mean_6.824999999999999,edge_p95_mean_0.0,edge_p95_median_3.51,edge_std_mean_0.0,edge_median_median_0.0,kurt_p5_mean_6.824999999999999
skew_p95_median_6.824999999999999,1.0,0.510125,0.724389,0.249432,0.834886,0.832558,0.996273,0.0906,0.153175,0.162858,0.652158,0.416088,0.647981,0.690748,0.547765
kurt_p5_median_6.824999999999999,0.510125,1.0,0.328346,0.876615,0.566943,0.562021,0.493258,0.155758,0.726412,0.734718,0.307443,0.392338,0.302081,0.221943,0.982133
edge_median_mean_0.0,0.724389,0.328346,1.0,0.151699,0.710809,0.713476,0.737978,0.170983,0.118849,0.097563,0.922119,0.499554,0.905294,0.95643,0.355926
img_std_norm (g/cm3)_mean_6.824999999999999,0.249432,0.876615,0.151699,1.0,0.326508,0.315996,0.237705,0.979538,0.731032,0.73567,0.147621,0.587511,0.151629,0.049764,0.858321
skew_p95_median_3.51,0.834886,0.566943,0.710809,0.326508,1.0,0.994575,0.836975,0.154557,0.280613,0.284436,0.634629,0.186829,0.626066,0.616216,0.589455
skew_p95_mean_3.51,0.832558,0.562021,0.713476,0.315996,0.994575,1.0,0.835003,0.167099,0.290558,0.295455,0.640683,0.185812,0.632299,0.623911,0.584025
skew_p95_mean_6.824999999999999,0.996273,0.493258,0.737978,0.237705,0.836975,0.835003,1.0,0.084273,0.129722,0.140232,0.665152,0.43703,0.66189,0.706139,0.535315
img_std_norm (g/cm3)_median_6.824999999999999,0.0906,0.155758,0.170983,0.979538,0.154557,0.167099,0.084273,1.0,0.200215,0.176654,0.214099,0.204473,0.238647,0.152955,0.137552
kurt_p95_median_6.824999999999999,0.153175,0.726412,0.118849,0.731032,0.280613,0.290558,0.129722,0.200215,1.0,0.986529,0.056559,0.599574,0.043934,0.180862,0.71513
kurt_p95_mean_6.824999999999999,0.162858,0.734718,0.097563,0.73567,0.284436,0.295455,0.140232,0.176654,0.986529,1.0,0.042562,0.592722,0.026979,0.169812,0.730918


From this selection of features, if we choose just three features with relatively low correlation with each other (they share minimal information), and use them rather than a set of 12 features, we can still get relatively decent prediction accuracy (especially for prediction of fertilizer, although all around not as accurate as using previously selected sets of 12 features).

In [136]:
# Collect features
X,y,scan = get_features(3)
y_split = y.str.split("-", expand=True)
yt = y_split[0]
yf = y_split[1]

a = ['skew_p95_median_6.824999999999999', 'kurt_p95_median_6.824999999999999', 'edge_p95_median_3.51']

scorestf, scoresmeantf, execution_time = evaluate_svn(X,a,y,scan)
scorest, scoresmeant, execution_time = evaluate_svn(X,a,yt,scan)
scoresf, scoresmeanf, execution_time = evaluate_svn(X,a,yf,scan)

scores = {'Fold': ['1', '2', '3', '4', '5', 'Average'],
        'Tillage-Fertilizer Classification': np.append(scorestf, scoresmeantf),
        'Tillage Classification': np.append(scorest, scoresmeant),
        'Fertilizer Classification': np.append(scoresf, scoresmeanf)}
df_scores = pd.DataFrame(scores)
df_scores

Unnamed: 0,Fold,Tillage-Fertilizer Classification,Tillage Classification,Fertilizer Classification
0,1,0.909091,0.909091,1.0
1,2,0.636364,0.727273,1.0
2,3,0.8,0.9,0.909091
3,4,0.545455,0.636364,1.0
4,5,0.727273,0.636364,0.8
5,Average,0.723636,0.761818,0.941818


## Conclusions:
Skew, Kurtosis, and Sobel edge seem to be good metrics to differentiate soils. Changes in skew and kurtosis seem indicative of fertilizer vs manure, while changes in edges seems indicative of tillage differences. Depth affects which metrics are most important, especially to differentiate tillage of high fertilizer soils

Future directions:
- How do these metrics connect with physical metrics like drainage, nutrient transport, etc? More experiments required.
- How do these metrics connect to pore structure metrics?
- How do bulk density measurements compare to x-ray calculated values?
