First, load the data,in this notebook we are using the PMI data from the paper:"Environmental predictors impact microbialbased postmortem interval (PMI) estimation models within human decomposition soils". The preprocessed data includes OTU/phylum/class/order abundance matrices (includes or not include environmental factors).

This analysis is mainly for the final project of EPP622, and the analysis is different from the previous file in these ways:

1. the data preprocessing is different, previously we "we only consider OTU/ASVs that make up $\ge 1\%$ of the total microbiome community as ``present''", here we will change the threshold to $0.1\%$ based on the paper.

2. To make it simple, only use 16S data and do not consider environmental data.




In [1]:
import sys
sys.path.append('../Code')
import loadData 
import RunML
import RunML_continue
import FS
import metric

import pandas as pd
import numpy as np
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
import pickle
import matplotlib.pyplot as plt

In [2]:
import glob
import os

In [3]:
PMIdata_path = '../Data/PMI/'

## No env model
16s (OTU/phylum/class/order) - no env

### Data preprocess


In [4]:
bact_noenv_files = glob.glob(PMIdata_path + 'bact.n.*.noenv.csv')


In [5]:
bact_noenv_files

['../Data/PMI/bact.n.otu.noenv.csv',
 '../Data/PMI/bact.n.order.noenv.csv',
 '../Data/PMI/bact.n.class.noenv.csv',
 '../Data/PMI/bact.n.phylum.noenv.csv']

In [6]:
# Read each CSV file into a list of dataframes
bact_noenv_df_list = [pd.read_csv(file) for file in bact_noenv_files]

NameError: name 'bact_ITS_noenv_files' is not defined

In [None]:
for df in bact_noenv_df_list:
    print(df.shape)

In [None]:
bact_noenv_df_list[3]

In [None]:
data_4taxa = []
col_names_4taxa = []
for df in bact_noenv_df_list:
    data = df.drop(df.columns[0], axis=1)
    cols_name = data.columns.tolist()
    data = data.values
    data =FS.relative_abundance(data)
    data_4taxa.append(data)
    col_names_4taxa.append(cols_name)

In [None]:
# target variable
y = bact_noenv_df_list[3].iloc[:, 0].values 
y

In [None]:
# Define the threshold
y_threshold = 2500

# Categorize the series based on the threshold
y = np.where(y > y_threshold, 'LONG', 'SHORT')

print(y)

In [None]:
list(y).count('LONG')

In [None]:
list(y).count('SHORT')

##### 1. calculate H statistics for OTU/phylum/class/order (both 16s and ITS)

In [None]:
weights_4taxa = []

In [None]:
for df in data_4taxa:
    print(np.shape(df))
    weights=FS.OTU_H_Score_fun(df,y,cutOff=0.001)
    weights_4taxa.append(weights)
    

In [None]:
for weight in weights_4taxa:
    print(len(weight))

In [None]:
max(weights_4taxa[3])

In [None]:
selectedOTU_index_4tax = []
eps_4tax = []

In [None]:
for weight in weights_4taxa:
    selectedOTU_index, eps=FS.indice_H_unisig(weight,y)
    print(eps)
    selectedOTU_index_4tax.append(selectedOTU_index)
    eps_4tax.append(eps)
    

Here, the number of selected features increased for each taxonomic level since we decrease the threshold.

##### 2. Select indices of the features based on H statistics and form the subset based on the selected features.
The default p value of the function is 10%, the resulted index is ranked by its H statistics descendingly.

Use "indice_H_unisig" if there is only one response, use "indice_H_multisig" for multiple responses.

weights_4taxa,selectedOTU_index_4tax,col_names_4taxa,eps_4tax

In [None]:
weights_sig_sorted_4taxa = []
col_names_sig_sorted_4taxa = []
for i in range(len(weights_4taxa)):
    weights_sig_sorted = weights_4taxa[i][selectedOTU_index_4tax[i]]
    col_names_sig_sorted = [col_names_4taxa[i][j] for j in selectedOTU_index_4tax[i]]
    weights_sig_sorted_4taxa.append(weights_sig_sorted)
    col_names_sig_sorted_4taxa.append(col_names_sig_sorted)

In [None]:
taxlabels = ['OTU', 'class', 'order', 'phylum']

# Assuming weights_sig_sorted_4taxa contains numeric arrays
# Ensure col_names_sig_sorted_4taxa contains the corresponding string labels for each point

plt.figure(figsize=(10, 10))
for i, array in enumerate(weights_sig_sorted_4taxa):
    x_values = [taxlabels[i]] * len(array)  # Label each point with its group (e.g., 'OTU', 'class', etc.)
    plt.scatter(x_values, array, label=f'{taxlabels[i]}')
    
    # Annotate each point with its name from col_names_sig_sorted_4taxa[i][j] and its value
    for j, z in enumerate(array):
        label = col_names_sig_sorted_4taxa[i][j]  # Get the corresponding label for this point
        plt.text(taxlabels[i], z, label, ha='center', va='bottom', fontsize=8, color='black')

plt.title('Dot Plot of H statistics')
plt.xlabel('Taxonomic Rank')
plt.ylabel('H statistics')
plt.show()

In [None]:
#plot the h statistics and cutoff descendingly
#for i in range(len(weights_4taxa)):
    #FS.plotWeightedIndex(weights_4taxa[i],threshold=eps_4tax[i])

In [None]:
data_4taxa[0]

#### 4. Model
Prepare 4 datasets: full dataset, our selected dataset, Lasso selected  dataset(based on the target variable), randomly selected data (selected the same numer of variables as in our method)

Use random forest and SVM as classifier, and will build both models for each response variable.

For Lasso, the dataset will be determined by the response variable, so the lasso subset is different for the models for different response variables.

For random selection, the  process will repeat iter=30 times to  find the mean accuracy and AUC

SMOTE  is used (the data is not balanced, as we can see the performance is really bad especially for SVM model when not using SMOTE)

In [None]:
iter =100
cls = ["RF","SVM"]

In [None]:
targetLabel=y

In [None]:
data_subset_4taxa = []
X_lasso_4taxa = []
xind_lasso_4taxa = []
for i, data  in enumerate(data_4taxa):
    X_lasso,xind_lasso = RunML_continue.LassoFeatureSelection(data,targetLabel)
    X_lasso_4taxa.append(X_lasso_4taxa)
    xind_lasso_4taxa.append(xind_lasso)
    data_subset = {"AllFeatures":data, 
               "SelectMicro": data[:,selectedOTU_index_4tax[i]],
               "Lasso":X_lasso,
               "Random":data
              }
    data_subset_4taxa.append(data_subset)

In [None]:
for dataset  in data_subset_4taxa:
    data_subset = dataset
    for datatype, subset in data_subset.items():
        print(np.shape(subset))

In [None]:
with open('../Data/PMI/subset_bact_4taxa_noenv.pkl', 'wb') as file:
    pickle.dump(data_subset_4taxa, file)


The  function will print out the accuracy and AUC for each dataset using each classifier, and also will return the y_actual, y_predict, y_predprob for future use.

In [None]:
#dict_cm = RunML_continue.runClassifier_FScompare(data_subsets= data_subset,y= targetLabel,N=iter,classifiers=cls)

In [None]:
xind_lasso_4taxa

In [None]:


def plotPresenseRatio(X,label,featurenames,posLabel,posText="",negText="",thresholdPercent=0.90,abundanceCutoff=0.01,entries=15):
    import matplotlib as mpl
    mpl.rcParams['figure.dpi'] = 300

    presenceCntPos = []
    presenceCntNeg = []
    
    X_relative = FS.relative_abundance(X)
    
    X_relative = X_relative.T
    if abundanceCutoff==0:
        flatten_list = list(chain.from_iterable(X_relative))
        flatten_list_sorted=sorted(flatten_list)
        abundanceCutoff=flatten_list[int(len(flatten_list_sorted)*float(threshold))]

    if posText=="" or negText=="":
        posText=posLabel
        negText="Not "+posLabel

    for k in range(len(X_relative)):## for each OTU
        OTUs = X_relative[k]## the samples for this OTU
        pos = 0
        neg = 0
        for i in range(len(OTUs)):
            if label[i] == posLabel:
                if OTUs[i] > abundanceCutoff:# if the value of OTU exceed the abundanceCutoff
                    pos += 1
            else:
                if OTUs[i] > abundanceCutoff:
                    neg += 1
        presenceCntPos.append(pos)# len= # of samples; each value is the number of OTUs that exceed the abundanceCutoff for Pos/Neg
        presenceCntNeg.append(neg)
        
    all_pos_label_cnt=list(label).count(posLabel)
    all_neg_label_cnt=len(label)-all_pos_label_cnt
    print(all_pos_label_cnt,all_neg_label_cnt)# these 3  lines can use  value_count
    
    presenceRatioPos=[float(x)/all_pos_label_cnt for x in presenceCntPos]# each element is for each OTU; shows the ratio of abundanced pos samples over all pos sample 
    presenceRatioNeg=[float(x)/all_neg_label_cnt for x in presenceCntNeg]

    import matplotlib.pyplot as plt
    y = range(entries)
    fig, axes = plt.subplots(ncols=2, sharey=True)
    bars_pos = axes[0].barh(y, presenceRatioPos, align='center', color='#ff7f00')
    bars_neg =axes[1].barh(y, presenceRatioNeg, align='center', color='#377eb8')
    axes[0].set_xlabel("Presence Ratio in "+posText)
    axes[1].set_xlabel("Presences Ratio "+negText)

    # Annotate each bar in the first subplot
    for i, bar in enumerate(bars_pos):
        axes[0].text(presenceRatioPos[i], bar.get_y() + bar.get_height() / 2, f'{presenceRatioPos[i]:.2f}', va='center', ha='left')

    # Annotate each bar in the second subplot
    for i, bar in enumerate(bars_neg):
        axes[1].text(presenceRatioNeg[i], bar.get_y() + bar.get_height() / 2, f'{presenceRatioNeg[i]:.2f}', va='center', ha='left')


    axes[0].set_xlim(0,1.2)
    axes[1].set_xlim(0,1.2)
    axes[0].invert_xaxis()# Invert the x-axis of the first subplot

    axes[0].set(yticks=y, yticklabels=[])
    for yloc, selectedASVs in zip(y, featurenames):
        axes[0].annotate(selectedASVs, (0.5, yloc), xycoords=('figure fraction', 'data'),
                         ha='center', va='center', fontsize=9)
    fig.tight_layout(pad=2.0)
    plt.show()

### compare the first 15 index by their present ratio

In [None]:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

entries=15

for i, index in enumerate(selectedOTU_index_4tax):
    selectedOTU_index_15=index[:entries]
    #print(selectedOTU_index_15)
    selectedASVs_15=col_names_sig_sorted_4taxa[i][:entries]
    print(selectedASVs_15)
    X_FS_15=data_4taxa[i][:,selectedOTU_index_15]
    #df=pd.DataFrame(data=X_FS_15)
    plotPresenseRatio(X_FS_15,targetLabel,selectedASVs_15,posLabel="LONG",posText="Long",negText="short",entries=entries)


In [None]:
##### check the plot results (use phylumn as example)
Phy_select_index = selectedOTU_index_4tax[3]
Phy_select_index_5 = Phy_select_index[0:5]
Phy_select_index_5

In [None]:
Phy_select_label_5 = [col_names_4taxa[3][i] for i in Phy_select_index_5]
print(Phy_select_label_5)
print(col_names_sig_sorted_4taxa[3][0:5])
print(weights_sig_sorted_4taxa[3][0:5])

In [None]:
X_pyhlum = data_4taxa[3][:,Phy_select_index_5]
#X_pyhlum = np.where(X_pyhlum > 0.01, 1, 0)

In [None]:
# test 
data_phy_test=FS.relative_abundance(data_4taxa[3])
FS.OTU_H_Score_arr(data_phy_test[:,selectedOTU_index_4tax[3][0:5]],targetLabel,cutOff=0.01)

In [None]:
print
for i in Phy_select_index_5:
    print(
    FS.OTU_H_Score(data_4taxa[3][:,i],targetLabel,cutOff=0.01)
    )

In [None]:
print(weights_sig_sorted_4taxa[3])
print(weights_4taxa[3][selectedOTU_index_4tax[3]])

### Negative Gini Impurity
Gini Impurity is the probability of incorrectly classifying a randomly chosen element in the dataset if it were randomly labeled according to the class distribution in the dataset. It’s calculated as:

$G = 1- \sum_{i=1}^C p_i^2$

where C is the number of classes. (which means it can be used to measure for multiple level classification)

Here I will use the negative Gini Impurity to measure each OTU, if NG is large (1) which means the OTU only exist in one class, if NG value is small($1/c$) which means the OTU is evenly distributed among  the classes.

$NG = \sum_{i=1}^C p_i^2$

In [None]:
np.unique(y, return_counts=True)

In [None]:
# NG for selected OTU
NG_4tax = []
for i, data  in enumerate(data_4taxa):
    X_FS = data[:,selectedOTU_index_4tax[i]]
    X_lasso = data[:,xind_lasso_4taxa[i]]
    NG_selected = metric.Neg_GINI(X_FS,y,cutOff=0.01)
    NG_Lasso = metric.Neg_GINI(X_lasso,y,cutOff=0.01)
    print(NG_selected.shape)
    print(NG_Lasso.shape)
    NG_4tax.append([NG_selected,NG_Lasso])

In [None]:
# compare the selected and non select by lasso
# Number of subplots
num_plots = len(data_4taxa)

# Create a figure with a grid of subplots
plt.figure(figsize=(4, 4 * num_plots))

# Loop through each index and create a subplot
for i in range(num_plots):
    plt.subplot(num_plots, 1, i + 1)  # (nrows, ncols, index)
    plt.boxplot([NG_4tax[i][0], NG_4tax[i][1]], tick_labels=['SelectMicro', 'Lasso'])
    plt.title(f'NG results of the selected OTU by SelectMicro vs. Lasso - {taxlabels[i]}')
    plt.ylabel('NG')
    plt.grid(axis='y')
# Adjust layout
plt.tight_layout()  # Adjusts the subplots to fit into the figure area.
plt.show()  # Show all plots at once

#### Analysis of the top features

In [None]:
for i, label in enumerate(col_names_sig_sorted_4taxa):
    print(taxlabels[i])
    print(label)