# Crush Rig Serosal Thickness T-Test Statistics

Written by Matt MacDonald for CIGITI at the Hospital for Sick Children Toronto

Serosal thickness measurements are taken from the histology data after the crushing of the tissue. Control measurements are from a non-crush location and crush measurements are at the location of the crush marked by blue ink. 

To determine descriptive statistics these measurements must be grouped and tested for statistically significant differences in mean or variation. The control and crush measurements are in seperate columns in the csv file; although they are paired this is just how the data was organized and doesn't have any relevance to the analysis.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.style.use('ggplot')
plt.rcParams['figure.dpi'] = 150

In [None]:
from crush_read import *
from crush_plot import *
PATH

In [None]:
ls $PATH

In [None]:
df = pd.read_csv(PATH / 'SEROSA.csv', na_values=['M'])  # missing values = M
df.shape

In [None]:
df.isna().sum()

In [None]:
df = df.dropna()
df.shape

In [None]:
df.dtypes

In [None]:
df.head()

In [None]:
crush = df.iloc[:, :4].copy()
crush.columns = ['patient', 'tissue', 'load', 'thickness']
crush['crush'] = True
crush.head()

In [None]:
control = df.iloc[:, [0, 1, 2, 4]].copy()
control.columns = ['patient', 'tissue', 'load', 'thickness']
control['crush'] = False
control.head()

In [None]:
data = pd.concat([crush, control])
data = data.reset_index(drop=True)
data.head()

In [None]:
data.tail()

# Statistical Groups
Create the groups that the t-tests will compare.

### 1. No grouping

In [None]:
cont = data[data.crush == False][['thickness']]
crush = data[data.crush == True][['thickness']]
stats = pd.concat([cont.describe(), crush.describe()], axis=1)
stats.columns = ['control', 'crush']
stats

In [None]:
stats.loc[['mean', 'std'], :].plot(kind='bar')
plt.title('Serosal thickness overall');

### 2. Group by patient

In [None]:
pat_cont = data[data.crush == False][['patient','thickness']].groupby('patient').describe().rename(columns={'thickness': 'control'})
pat_crush = data[data.crush == True][['patient','thickness']].groupby('patient').describe().rename(columns={'thickness': 'crush'})
stats = pd.concat([pat_cont, pat_crush], axis=1, levels=['control', 'crush'])
stats

In [None]:
stats.loc[:, (['control', 'crush'], ['mean', 'std'])].plot(kind='bar')
plt.title('Serosal thickness by patient');

The patients that show a lot of variation in the serosal thickness show it for control and crush groups. This indicates a consistent variation between patients but it is felt this has more to due with the histology preparation than a true anatomical difference between patients.

In [None]:
def ratio(v):
    r = v.max() / v.min()
    return round(float(r), 1), r < 2 

for i in stats.index:
    print(ratio(stats.loc[i, (['control', 'crush'], ['std'])].values))

Most of the individual patient groups show a ratio of standard deviation between control and crush measurements less than two. It is reasonable to assume that the variances are equal for conducting t-tests.

### 3. Group by load level

In [None]:
load_cont = data[(data.crush == False)][['load','thickness']].groupby('load').describe().rename(columns={'thickness': 'control'})
load_crush = data[(data.crush == True)][['load','thickness']].groupby('load').describe().rename(columns={'thickness': 'crush'})
stats = pd.concat([load_cont, load_crush], axis=1, levels=['control', 'crush'])
stats

In [None]:
stats.loc[:, (['control', 'crush'], ['mean', 'std'])].plot(kind='bar')
plt.title('Serosal thickness by load level');

In [None]:
for i in stats.index:
    print(ratio(stats.loc[i, (['control', 'crush'], ['std'])].values))

When grouping by load level and ignoring which patient the measurements come from the trend is still very clear. Note however that the standard deviation is much higher becasue of patient to patient variation. The equal variances assumption is very valid when looked at with this grouping.

### 4. Group by patient and load level

In [None]:
pat_load_cont = data[(data.crush == False)][['patient', 'load', 'thickness']].groupby(['patient', 'load']).describe().rename(columns={'thickness': 'control'})
pat_load_crush = data[(data.crush == True)][['patient', 'load', 'thickness']].groupby(['patient', 'load']).describe().rename(columns={'thickness': 'crush'})
stats = pd.concat([pat_load_cont, pat_load_crush], axis=1, levels=['control', 'crush'])
stats

With this many groups it is not possible to plot in a cohesive way. Let's look at specific patients and load levels.

In [None]:
p = np.random.choice(stats.index.levels[0])
stats.loc[(p), (['control', 'crush'], ['mean', 'std'])].plot(kind='bar')
plt.title(f'Serosal thickness by load level for patient {p}');

In [None]:
load_level = 200
stats.xs(load_level, level=1).loc[:, (['control', 'crush'], ['mean', 'std'])].plot(kind='bar')
plt.title(f'Serosal thickness by patient at {load_level}g');

In [None]:
load_level = 1200
stats.xs(load_level, level=1).loc[:, (['control', 'crush'], ['mean', 'std'])].plot(kind='bar')
plt.title(f'Serosal thickness by patient at {load_level}g');

In [None]:
ratios = []
for i in stats.index:
    ratios.append(ratio(stats.loc[i, (['control', 'crush'], ['std'])].values))
ratios = np.array(ratios)
print(f"{100 * ratios.sum(axis=0)[1] / ratios.shape[0]}% of variance ratios are < 2 for the {ratios.shape[0]} groups")

When grouping by both patient and load level the trend is still very clear that the mean thickness is dependent on load level. Overall standard deviations are much lower because we are not comparing over different patients. However, the equal variances assumption isnot valid for a quarter of the sets when looked at with this grouping. Thus it would be advisable to not make an equal variance assumption.

### 5. Group by tissue type

In [None]:
tiss_cont = data[(data.crush == False)][['tissue','thickness']].groupby('tissue').describe().rename(columns={'thickness': 'control'})
tiss_crush = data[(data.crush == True)][['tissue','thickness']].groupby('tissue').describe().rename(columns={'thickness': 'crush'})
stats = pd.concat([tiss_cont, tiss_crush], axis=1, levels=['control', 'crush'])
stats

In [None]:
stats.loc[:, (['control', 'crush'], ['mean', 'std'])].plot(kind='bar')
plt.title('Serosal thickness by tissue type');

In [None]:
for i in stats.index:
    print(ratio(stats.loc[i, (['control', 'crush'], ['std'])].values))

When grouping by tissue type and ignoring load level or patient the trend is still very clear of serosal thinning post crush. Nonetheless this comparison is not particularly useful except to show that small bowel serosal layer tends to be twice as thick as for the colon.

Note that grouping by patient and tissue is the same as grouping by patient since only one tissue type was measured for each patient, so it is excluded.

### 6. Group by tissue type and load level

In [None]:
tiss_load_cont = data[(data.crush == False)][['tissue', 'load', 'thickness']].groupby(['tissue', 'load']).describe().rename(columns={'thickness': 'control'})
tiss_load_crush = data[(data.crush == True)][['tissue', 'load', 'thickness']].groupby(['tissue', 'load']).describe().rename(columns={'thickness': 'crush'})
stats = pd.concat([tiss_load_cont, tiss_load_crush], axis=1, levels=['control', 'crush'])
stats

In [None]:
stats.loc[:, (['control', 'crush'], ['mean', 'std'])].plot(kind='bar')
plt.title('Serosal thickness by tissue type and load level');

In [None]:
ratios = []
for i in stats.index:
    ratios.append(ratio(stats.loc[i, (['control', 'crush'], ['std'])].values))
ratios = np.array(ratios)
print(f"{100 * ratios.sum(axis=0)[1] / ratios.shape[0]:.2f}% of variance ratios are < 2 for the {ratios.shape[0]} groups")

When grouping by both tissue type and load level the trend is still very clear that the mean thickness is dependent on load level. The transition to significant differences seems to the eye to be at a similar threshold between the two tissue types. Standard deviations are higher due to averaging over patients. However, the equal variances assumption is not valid for this grouping either

# T-TESTS

Groupings 3, 4, and 6 are the most valuable. Any grouping that averages across load levels will not be valuable to use because we are certain that the load level is the key parameter, which can be clearly seen in the plots.

For the comparison of the control and crush serosal thickness measurements a one-tailed t-test will be used since it is known that the serosal thins due to crush, never thickens. The t-tests are independent, however a relative t-test will be done in addition for grouping 4 since there is patient specific pairing of measurements.

The results will show the patient, tissue, and/or load level as needed along with the t score and p score statistics. Significant p scores will be less than 0.05. The absolute and percent deformation delta (from control to crush) averages for the groups will also be output for use in modelling.

In [None]:
from scipy.stats import ttest_ind, ttest_rel

### 3. Group by load level
Assumption of equal variance is valid.

In [None]:
tstats = np.ones([6, 4]) * np.nan
tstats[:, 0] = data.load.unique()
tstats

In [None]:
load_cont = data[(data.crush == False)][['load','thickness']]
load_crush = data[(data.crush == True)][['load','thickness']]

print(f"Control measurements: {load_cont.shape[0]}")
print(f"Crush measurements: {load_crush.shape[0]}")

In [None]:
for i in range(tstats.shape[0]):
    load = tstats[i, 0]
    ttest, pval = ttest_ind(load_cont[load_cont.load == load].thickness,
                            load_crush[load_crush.load == load].thickness,
                            equal_var=True)
    tstats[i, 1] = ttest
    tstats[i, 2] = pval / 2 # one tailed
    tstats[i, 3] = tstats[i, 2] < 0.05

tstats = pd.DataFrame(tstats, columns=['Load (g)', 'T Score', 'P Score', 'Significant'])

Add in the deformation delta values.

In [None]:
tstats['Absolute Delta (um)'] = np.nan
tstats['Percent Delta'] = np.nan

for i in tstats.index:
    load = tstats.loc[i, 'Load (g)']
    initial = load_cont[load_cont.load == load].thickness.mean()
    delta = load_crush[load_crush.load == load].thickness.mean() - initial
    tstats.loc[i, 'Absolute Delta (um)'] = delta
    tstats.loc[i, 'Percent Delta'] = 100 * delta / initial

In [None]:
tstats

In [None]:
tstats.to_csv('ttests_load.csv')

In [None]:
tstats = tstats.set_index('Load (g)')
tstats[['T Score', 'P Score']].plot();

In [None]:
tstats[['Absolute Delta (um)', 'Percent Delta']].plot();

### 4. Group by patient and load level
Assumption of equal variance is NOT valid.

In [None]:
cont = data[(data.crush == False)][['patient','tissue','load','thickness']]
crush = data[(data.crush == True)][['patient','tissue','load','thickness']]

print(f"Control measurements: {cont.shape[0]}")
print(f"Crush measurements: {crush.shape[0]}")

In [None]:
patient_opts = data.patient.unique()
load_opts = data.load.unique()

In [None]:
tstats = pd.DataFrame(columns=['Patient Code', 'Tissue', 'Load (g)', 'T Score', 'P Score', 'Significant'])

In [None]:
tstats_list = []
tstats_rel_list = []
for i in range(patient_opts.shape[0]):
    patient = patient_opts[i]
    for j in range(load_opts.shape[0]):
        load = load_opts[j]
        
        # Add row to dataframe
        tstats = tstats.append({'Patient Code': patient, 
                                'Tissue': (data[data.patient == patient].tissue).unique()[0],
                                'Load (g)': load}, ignore_index=True)
        
        # Do independent t-test and store
        ttest, pval = ttest_ind(cont[(cont.patient == patient) & (cont.load == load)].thickness,
                                crush[(crush.patient == patient) & (crush.load == load)].thickness,
                                equal_var=False)
        pval = pval / 2 # one tailed
        tstats_list.append((ttest, pval, pval < 0.05))
        
        # Do relative t-test and store
        ttest, pval = ttest_rel(cont[(cont.patient == patient) & (cont.load == load)].thickness,
                                crush[(crush.patient == patient) & (crush.load == load)].thickness)
        pval = pval / 2 # one tailed
        tstats_rel_list.append((ttest, pval, pval < 0.05))

In [None]:
# Add statistics to dataframe
tstats_list = np.array(tstats_list)
tstats_rel_list = np.array(tstats_rel_list)
tstats_rel = tstats.copy()  # relative t-test dataframe

for i in range(tstats.shape[0]):
    tstats.iloc[i, 3:] = tstats_list[i, :]
    tstats_rel.iloc[i, 3:] = tstats_rel_list[i, :]

# Remove missing combinations
tstats = tstats.dropna()
tstats_rel = tstats_rel.dropna()

Add in the deformation delta values.

In [None]:
def add_deformation(tstats):
    tstats = tstats.copy()
    tstats['Absolute Delta (um)'] = np.nan
    tstats['Percent Delta'] = np.nan

    for i in tstats.index:
        patient = tstats.loc[i, 'Patient Code']
        load = tstats.loc[i, 'Load (g)']
        initial = cont[(cont.patient == patient) & (cont.load == load)].thickness.mean()
        delta = crush[(crush.patient == patient) & (crush.load == load)].thickness.mean() - initial
        tstats.loc[i, 'Absolute Delta (um)'] = delta
        tstats.loc[i, 'Percent Delta'] = 100 * delta / initial
    return tstats

In [None]:
tstats = add_deformation(tstats)
tstats_rel = add_deformation(tstats_rel)

In [None]:
tstats

Compare the relative t-test results.

In [None]:
tstats_rel

In [None]:
print("Number of significant t-tests:")
print(f"Independent = {tstats['Significant'].sum()}")
print(f"Relative = {tstats_rel['Significant'].sum()}")

In [None]:
print(f"Disagreements = {(tstats['Significant'] - tstats_rel['Significant']).abs().sum()}")

In [None]:
mask = tstats['Significant'] != tstats_rel['Significant']
tstats[mask]

In [None]:
tstats_rel[mask]

Very few disagreements between the two types of t-tests are present. The following plots show the patients where there is disagreement.

In [None]:
pat_load_cont = data[(data.crush == False)][['patient', 'load', 'thickness']].groupby(['patient', 'load']).describe().rename(columns={'thickness': 'control'})
pat_load_crush = data[(data.crush == True)][['patient', 'load', 'thickness']].groupby(['patient', 'load']).describe().rename(columns={'thickness': 'crush'})
stats = pd.concat([pat_load_cont, pat_load_crush], axis=1, levels=['control', 'crush'])

In [None]:
for p in tstats[mask]['Patient Code']:
    stats.loc[(p), (['control', 'crush'], ['mean', 'std'])].plot(kind='bar')
    plt.title(f'Serosal thickness by load level for patient {p}');

On inspection both results are reasonable and don't depend on tissue type, but per best practices in statistics the relative t-test will be used. Since each set of measurements (crush and control) come from the same patient histological slide for this grouping the paired t-test is most appropriate. The independent t-test runs the risk of being overly confident in identifying significance.

In [None]:
tstats_rel.to_csv('ttests_all.csv')

In [None]:
tstats_table = tstats_rel.groupby(['Patient Code', 'Tissue', 'Load (g)'], sort=False).max()
tstats_table

### 6. Group by tissue type and load level
Assumption of equal variance is NOT valid.

In [None]:
cont = data[(data.crush == False)][['patient','tissue','load','thickness']]
crush = data[(data.crush == True)][['patient','tissue','load','thickness']]

print(f"Control measurements: {cont.shape[0]}")
print(f"Crush measurements: {crush.shape[0]}")

In [None]:
tissue_opts = data.tissue.unique()
load_opts = data.load.unique()

In [None]:
tstats = pd.DataFrame(columns=['Tissue', 'Load (g)', 'T Score', 'P Score', 'Significant'])

In [None]:
tstats_list = []
for i in range(tissue_opts.shape[0]):
    tissue = tissue_opts[i]
    for j in range(load_opts.shape[0]):
        load = load_opts[j]
        
        # Add row to dataframe
        tstats = tstats.append({'Tissue': tissue, 'Load (g)': load}, ignore_index=True)
        
        # Do independent t-test and store
        ttest, pval = ttest_ind(cont[(cont.tissue == tissue) & (cont.load == load)].thickness,
                                crush[(crush.tissue == tissue) & (crush.load == load)].thickness,
                                equal_var=False)
        pval = pval / 2 # one tailed
        tstats_list.append((ttest, pval, pval < 0.05))

In [None]:
# Add statistics to dataframe
tstats_list = np.array(tstats_list)
for i in range(tstats.shape[0]):
    tstats.iloc[i, 2:] = tstats_list[i, :]

# Remove missing combinations
tstats = tstats.dropna()

In [None]:
tstats

Add in the deformation delta values.

In [None]:
tstats['Absolute Delta (um)'] = np.nan
tstats['Percent Delta'] = np.nan

for i in tstats.index:
    tissue = tstats.loc[i, 'Tissue']
    load = tstats.loc[i, 'Load (g)']
    initial = cont[(cont.tissue == tissue) & (cont.load == load)].thickness.mean()
    delta = crush[(crush.tissue == tissue) & (crush.load == load)].thickness.mean() - initial
    tstats.loc[i, 'Absolute Delta (um)'] = delta
    tstats.loc[i, 'Percent Delta'] = 100 * delta / initial

In [None]:
tstats.to_csv('ttests_tissue.csv')

In [None]:
tstats = tstats.set_index('Load (g)')
for tissue in tissue_opts:
    tstats[tstats['Tissue'] == tissue][['T Score', 'P Score']].plot()
    plt.title(tissue)

In [None]:
tstats

In [None]:
for tissue in tissue_opts:
    tstats[tstats['Tissue'] == tissue][['Absolute Delta (um)', 'Percent Delta']].plot()
    plt.title(tissue)