Code to analyze the Beta diversity

In [1]:
# importing all required packages & notebook extensions at the start of the notebook
import os
import pandas as pd
import qiime2 as q2
from skbio import OrdinationResults
from qiime2 import Visualization
from seaborn import scatterplot

%matplotlib inline

In [40]:
#all variables
Data_raw='Data/raw'
Data_classified='Data/classified'
Data_diversity='Data/diversity'

<div style="background-color: skyblue; padding: 10px;">
    Titles
    </div>
<div style="background-color: aliceblue; padding: 10px;">
    Results

<div style="background-color: lightblue; padding: 10px;">
  This is light blue.
</div>

<div style="background-color: skyblue; padding: 10px;">
  This is sky blue.
</div>

<div style="background-color: deepskyblue; padding: 10px;">
  This is deep sky blue.
</div>

<div style="background-color: cornflowerblue; padding: 10px;">
  This is cornflower blue.
</div>

<div style="background-color: blue; padding: 10px;">
  This is pure blue.
</div>


<div style="background-color: skyblue; padding: 10px;">

## Creating the necessary files  
**Creating files on overall data**  
Diversity core metrics could be created on Jupyterhub

Adjusting merged_output.tsv to adapt for analysis without problems (no spaces in column titles)

In [44]:
df = pd.read_csv(f'{Data_raw}/merged_output.tsv', sep='\t')
# Keep first column name, modify the rest
new_columns = [df.columns[0]] + [col.replace(' ', '_').replace('/', '_') for col in df.columns[1:]]
df.columns = new_columns
df.to_csv(f'{Data_raw}/merged_output_usable.tsv', sep='\t', index=False)

Creating one metadata file with only metadata for people who filled out survey

In [79]:
df = pd.read_csv("Data/raw/merged_output_usable.tsv", sep="\t")

columns_to_check = ["start_time", "completion_time"]

df_filtered = df.dropna(subset=columns_to_check)

df_filtered.to_csv("Data/diversity/filtered-metadata/meta_survey.tsv", sep="\t", index=False)

**Creating the files on all metadata**

In [9]:
! qiime diversity core-metrics \
  --i-table $Data_classified/table-filtered.qza \
  --m-metadata-file $Data_raw/merged_output.tsv\
  --p-sampling-depth 3000 \
  --output-dir $Data_diversity/core-metrics-results-merged

  import pkg_resources
[32mSaved FeatureTable[Frequency] to: Data/diversity/core-metrics-results-merged/rarefied_table.qza[0m
[32mSaved SampleData[AlphaDiversity] to: Data/diversity/core-metrics-results-merged/observed_features_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: Data/diversity/core-metrics-results-merged/shannon_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: Data/diversity/core-metrics-results-merged/evenness_vector.qza[0m
[32mSaved DistanceMatrix to: Data/diversity/core-metrics-results-merged/jaccard_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: Data/diversity/core-metrics-results-merged/bray_curtis_distance_matrix.qza[0m
[32mSaved PCoAResults to: Data/diversity/core-metrics-results-merged/jaccard_pcoa_results.qza[0m
[32mSaved PCoAResults to: Data/diversity/core-metrics-results-merged/bray_curtis_pcoa_results.qza[0m
[32mSaved Visualization to: Data/diversity/core-metrics-results-merged/jaccard_emperor.qzv[0m
[32mSaved Visualization 

The following code was then submitted as a job on Euler, due to too little memory capacity on Jupyterhub

In [None]:
#!/bin/bash
#SBATCH --job-name=kmerizer
#SBATCH --time=04:00:00
#SBATCH --mem-per-cpu=32GB
#SBATCH --cpus-per-task=4
#SBATCH --output=kmerizer.log
source /cluster/home/nschwager/miniconda3/etc/profile.d/conda.sh
source ~/.bashrc
conda activate qiime2-amplicon-2025.10
qiime kmerizer core-metrics \
  --i-table /cluster/home/nschwager/kmerizer/table-filtered.qza \
  --i-sequences /cluster/home/nschwager/kmerizer/rep-seqs-filtered.qza \
  --m-metadata-file /cluster/home/nschwager/kmerizer/merged_output.tsv \
  --p-sampling-depth 3000 \
  --output-dir /cluster/scratch/home/kmerizer/kmerizer-results-merged

**Creating files on just sourdough data**  
First filtering the coresponding tables

In [35]:
!qiime feature-table filter-samples \
  --i-table $Data_classified/table-filtered.qza \
  --m-metadata-file $Data_raw/merged_output.tsv  \
  --p-where "sample_type='sourdough'" \
  --o-filtered-table $Data_classified/table-filtered-sourdough_only.qza

  import pkg_resources
[32mSaved FeatureTable[Frequency] to: Data/classified/table-filtered-sourdough_only.qza[0m
[0m[?25h

In [40]:
!qiime feature-table filter-seqs \
  --i-data $Data_classified/rep-seqs-filtered.qza \
  --i-table $Data_classified/table-filtered-sourdough_only.qza \
  --o-filtered-data $Data_classified/rep-seqs-filtered-sourdough_only.qza

  import pkg_resources
[32mSaved FeatureData[Sequence] to: Data/classified/rep-seqs-filtered-sourdough_only.qza[0m
[0m[?25h

Creating the metrics files

In [37]:
! qiime diversity core-metrics \
  --i-table $Data_classified/table-filtered-sourdough_only.qza \
  --m-metadata-file $Data_raw/merged_output.tsv\
  --p-sampling-depth 3000 \
  --output-dir $Data_diversity/core-metrics-results-merged-sourdough-only

  import pkg_resources
[32mSaved FeatureTable[Frequency] to: Data/diversity/core-metrics-results-merged-sourdough-only/rarefied_table.qza[0m
[32mSaved SampleData[AlphaDiversity] to: Data/diversity/core-metrics-results-merged-sourdough-only/observed_features_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: Data/diversity/core-metrics-results-merged-sourdough-only/shannon_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: Data/diversity/core-metrics-results-merged-sourdough-only/evenness_vector.qza[0m
[32mSaved DistanceMatrix to: Data/diversity/core-metrics-results-merged-sourdough-only/jaccard_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: Data/diversity/core-metrics-results-merged-sourdough-only/bray_curtis_distance_matrix.qza[0m
[32mSaved PCoAResults to: Data/diversity/core-metrics-results-merged-sourdough-only/jaccard_pcoa_results.qza[0m
[32mSaved PCoAResults to: Data/diversity/core-metrics-results-merged-sourdough-only/bray_curtis_pcoa_results.qza[0m


In [42]:
!qiime kmerizer core-metrics \
  --i-table $Data_classified/table-filtered-sourdough_only.qza \
  --i-sequences $Data_classified/rep-seqs-filtered-sourdough_only.qza \
  --m-metadata-file $Data_raw/merged_output.tsv\
  --p-sampling-depth 3000 \
  --output-dir $Data_diversity/kmerizer-results-merged-sourdough-only

  import pkg_resources
[32mSaved FeatureTable[Frequency] to: Data/diversity/kmerizer-results-merged-sourdough-only/rarefied_table.qza[0m
[32mSaved FeatureTable[Frequency] to: Data/diversity/kmerizer-results-merged-sourdough-only/kmer_table.qza[0m
[32mSaved SampleData[AlphaDiversity] to: Data/diversity/kmerizer-results-merged-sourdough-only/observed_features_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: Data/diversity/kmerizer-results-merged-sourdough-only/shannon_vector.qza[0m
[32mSaved DistanceMatrix to: Data/diversity/kmerizer-results-merged-sourdough-only/jaccard_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: Data/diversity/kmerizer-results-merged-sourdough-only/bray_curtis_distance_matrix.qza[0m
[32mSaved PCoAResults to: Data/diversity/kmerizer-results-merged-sourdough-only/jaccard_pcoa_results.qza[0m
[32mSaved PCoAResults to: Data/diversity/kmerizer-results-merged-sourdough-only/bray_curtis_pcoa_results.qza[0m
[32mSaved Visualization to: Data/diversi

<div style="background-color: skyblue; padding: 10px;">

## Analysis of whole Metadata ITS

In [4]:
Visualization.load(f"{Data_diversity}/kmerizer-results-merged/scatterplot.qzv")

<div style="background-color: aliceblue; padding: 10px;">
    
- Hand swabs and sourdough communities show different sets of fungis and different relative abundance  
- when doing day as y-axis and bray-curtis as x-axis there seems to be more and more similarities for the sourdoughs to the hand over time  
- there appears to be no difference between right & left hand

In [13]:
Visualization.load(f"{Data_diversity}/core-metrics-results-merged/bray_curtis_emperor.qzv")

**Comparison of the hand & sourdough environment**  

Bray-curtis

In [15]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $Data_diversity/kmerizer-results-merged/bray_curtis_distance_matrix.qza \
    --m-metadata-file $Data_raw/merged_output.tsv \
    --m-metadata-column sample_type \
    --p-pairwise \
    --o-visualization $Data_diversity/kmerizer-results-merged/bray_curtis-sample_type-significance.qzv

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results-merged/bray_curtis-sample_type-significance.qzv[0m
[0m[?25h

In [3]:
Visualization.load(f"{Data_diversity}/kmerizer-results-merged/bray_curtis-sample_type-significance.qzv")

Jaccard

In [19]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $Data_diversity/kmerizer-results-merged/jaccard_distance_matrix.qza \
    --m-metadata-file $Data_raw/merged_output.tsv \
    --m-metadata-column sample_type \
    --p-pairwise \
    --o-visualization $Data_diversity/kmerizer-results-merged/jaccard-sample_type-significance.qzv

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results-merged/jaccard-sample_type-significance.qzv[0m
[0m[?25h

In [4]:
Visualization.load(f"{Data_diversity}/kmerizer-results-merged/jaccard-sample_type-significance.qzv")

<div style="background-color: aliceblue; padding: 10px;">

- Comparison of the sample types (handswabs vs sourdough) shows a significant difference in the composition and abundance (bray curtis: p value of 0.001, q-value of 0.0025 & F value of 245.525768; jaccard metric: p of 0.001, q-value of 0.001429 and pseudo F-value of 102.962864) 
- which indicates a high proportion of features that are not shared between hand and sourdough overall

**Comparison of hand & sourdough**

In [None]:
!qiime longitudinal pairwise-distances \
    --i-distance-matrix $Data_diversity/kmerizer-results-merged/bray_curtis_distance_matrix.qza \
    --m-metadata-file $Data_raw/merged_output.tsv \
    --p-state-column day \
    --p-group-column sample_type
    --p-state-1 7.0 \
    --p-state-2 21.0 \
    --p-individual-id-column person-id \
    --o-visualization $Data_diversity/kmerizer-results-merged/bray-curtis-hand-assimilation-over-time.qzv

**Comparison of hand**  

Bray-Curtis

In [15]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $Data_diversity/kmerizer-results/bray_curtis_distance_matrix.qza \
    --m-metadata-file $Data_raw/20250913_metadata_ITS.tsv \
    --m-metadata-column hand \
    --p-pairwise \
    --o-visualization $Data_diversity/kmerizer-results/bray_curtis-hand-significance.qzv

Visualization.load(f"{Data_diversity}/kmerizer-results/bray_curtis-hand-significance.qzv")

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results/bray_curtis-hand-significance.qzv[0m
[0m[?25h

Jaccard

In [16]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $Data_diversity/kmerizer-results/jaccard_distance_matrix.qza \
    --m-metadata-file $Data_raw/20250913_metadata_ITS.tsv \
    --m-metadata-column hand \
    --p-pairwise \
    --o-visualization $Data_diversity/kmerizer-results/jaccard-hand-significance.qzv

Visualization.load(f"{Data_diversity}/kmerizer-results/jaccard-hand-significance.qzv")

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results/jaccard-hand-significance.qzv[0m
[0m[?25h

<div style="background-color: aliceblue; padding: 10px;">

- there is no difference in abundance or composition between right & left hand
- Bray-Curtis: p-value: 0.766, q-value: 0.766, pseudo F-value: 0.746564  
- Jaccard: p & q value: 0.297 and pseude F-value: 1.024582

<div style="background-color: skyblue; padding: 10px;">

## Analysis of filtered only Sourdough metadata

In [50]:
Visualization.load(f"{Data_diversity}/kmerizer-results-merged-sourdough-only/scatterplot.qzv")

<div style="background-color: aliceblue; padding: 10px;">

- it seems like the background plays less and less a role (xField Bray-Curtis 1 yField day) and it seems to explain most of the difference in composition in the beginning)  
- if day21 aromas = null means, that they were no aromas, then lower pH is associated with more aromas

In [47]:
Visualization.load(f"{Data_diversity}/core-metrics-results-merged-sourdough-only/bray_curtis_emperor.qzv")

<div style="background-color: aliceblue; padding: 10px;">

- strong difference between background sterile & non-sterile
- some aromas on day 28 seem to appear either only on the sterile / non-sterile side

**Control that no effect trough different plates**

Bray-curtis

In [73]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $Data_diversity/kmerizer-results-merged-sourdough-only/bray_curtis_distance_matrix.qza \
    --m-metadata-file $Data_raw/merged_output.tsv\
    --m-metadata-column plate \
    --p-pairwise \
    --o-visualization $Data_diversity/kmerizer-results-merged-sourdough-only/bray-curtis-plate.qzv

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results-merged-sourdough-only/bray-curtis-plate.qzv[0m
[0m[?25h

In [74]:
Visualization.load(f"{Data_diversity}/kmerizer-results-merged-sourdough-only/bray-curtis-plate.qzv")

Jaccard

In [75]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $Data_diversity/kmerizer-results-merged-sourdough-only/jaccard_distance_matrix.qza \
    --m-metadata-file $Data_raw/merged_output.tsv\
    --m-metadata-column plate \
    --p-pairwise \
    --o-visualization $Data_diversity/kmerizer-results-merged-sourdough-only/jaccard-plate.qzv

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results-merged-sourdough-only/jaccard-plate.qzv[0m
[0m[?25h

In [76]:
Visualization.load(f"{Data_diversity}/kmerizer-results-merged-sourdough-only/jaccard-plate.qzv")

<div style="background-color: aliceblue; padding: 10px;">
    
- no influence of plate on sourdough composition -> good

**Effect of Background on Sourdough**

In [3]:
!qiime longitudinal pairwise-distances \
    --i-distance-matrix $Data_diversity/kmerizer-results-merged-sourdough-only/bray_curtis_distance_matrix.qza \
    --m-metadata-file $Data_raw/merged_output.tsv \
    --p-state-column day \
    --p-group-column background \
    --p-state-1 7.0 \
    --p-state-2 21.0 \
    --p-individual-id-column person-id \
    --o-visualization $Data_diversity/kmerizer-results-merged-sourdough-only/bray-curtis-background-difference-over-time.qzv

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results-merged-sourdough-only/bray-curtis-background-difference-over-time.qzv[0m
[0m[?25h

In [4]:
Visualization.load(f"{Data_diversity}/kmerizer-results-merged-sourdough-only/bray-curtis-background-difference-over-time.qzv")

In [5]:
!qiime longitudinal pairwise-distances \
    --i-distance-matrix $Data_diversity/kmerizer-results-merged-sourdough-only/jaccard_distance_matrix.qza \
    --m-metadata-file $Data_raw/merged_output.tsv \
    --p-state-column day \
    --p-group-column background \
    --p-state-1 7.0 \
    --p-state-2 21.0 \
    --p-individual-id-column person-id \
    --o-visualization $Data_diversity/kmerizer-results-merged-sourdough-only/jaccard-background-difference-over-time.qzv

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results-merged-sourdough-only/jaccard-background-difference-over-time.qzv[0m
[0m[?25h

In [6]:
Visualization.load(f"{Data_diversity}/kmerizer-results-merged-sourdough-only/jaccard-background-difference-over-time.qzv")

<div style="background-color: aliceblue; padding: 10px;">
    
There is a highly significant difference in change of the sourdough fungal abundance (significant bray-curtis), but not in composition (non-significant jaccard) over time, depending on the background. 

**Effect of latitude & longitude**

Bray-curtis

In [98]:
!qiime diversity filter-distance-matrix \
    --i-distance-matrix $Data_diversity/kmerizer-results-merged-sourdough-only/bray_curtis_distance_matrix.qza \
    --m-metadata-file $Data_diversity/filtered-metadata/meta_survey.tsv  \
    --o-filtered-distance-matrix $Data_diversity/kmerizer-results-merged-sourdough-only/bray_curtis_distance_matrix_survey.qza

  import pkg_resources
[32mSaved DistanceMatrix to: Data/diversity/kmerizer-results-merged-sourdough-only/bray_curtis_distance_matrix_survey.qza[0m
[0m[?25h

In [99]:
! qiime diversity adonis \
    --i-distance-matrix $Data_diversity/kmerizer-results-merged-sourdough-only/bray_curtis_distance_matrix_survey.qza \
    --m-metadata-file $Data_diversity/filtered-metadata/meta_survey.tsv  \
    --p-formula "latitude*longitude" \
    --o-visualization $Data_diversity/kmerizer-results-merged-sourdough-only/bray_curtis_location.qzv

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results-merged-sourdough-only/bray_curtis_location.qzv[0m
[0m[?25h

In [100]:
Visualization.load(f"{Data_diversity}/kmerizer-results-merged-sourdough-only/bray_curtis_location.qzv")

Jaccard

In [101]:
!qiime diversity filter-distance-matrix \
    --i-distance-matrix $Data_diversity/kmerizer-results-merged-sourdough-only/jaccard_distance_matrix.qza \
    --m-metadata-file $Data_diversity/filtered-metadata/meta_survey.tsv  \
    --o-filtered-distance-matrix $Data_diversity/kmerizer-results-merged-sourdough-only/jaccard_distance_matrix_survey.qza

  import pkg_resources
[32mSaved DistanceMatrix to: Data/diversity/kmerizer-results-merged-sourdough-only/jaccard_distance_matrix_survey.qza[0m
[0m[?25h

In [102]:
! qiime diversity adonis \
    --i-distance-matrix $Data_diversity/kmerizer-results-merged-sourdough-only/jaccard_distance_matrix_survey.qza \
    --m-metadata-file $Data_diversity/filtered-metadata/meta_survey.tsv  \
    --p-formula "latitude*longitude" \
    --o-visualization $Data_diversity/kmerizer-results-merged-sourdough-only/jaccard_location.qzv

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results-merged-sourdough-only/jaccard_location.qzv[0m
[0m[?25h

In [103]:
Visualization.load(f"{Data_diversity}/kmerizer-results-merged-sourdough-only/jaccard_location.qzv")

<div style="background-color: aliceblue; padding: 10px;">

- geographical location doesn't explain difference in abundance & composition of the sourdough's

**Effect of pets**

Fill in NaN's for 0 because adonis can't work with NaN's and we already have just the metadata for people who filled out the survey

In [96]:
df = pd.read_csv("Data/raw/merged_output_usable.tsv", sep="\t")

columns_to_check = ["guinea_pig", "cat", "dog", "turtle", "fish"]

df[columns_to_check] = df[columns_to_check].fillna(0.0)
df.to_csv("Data/diversity/filtered-metadata/meta_pets.tsv", sep="\t", index=False)

Bray curtis (no need to filter distance matrix, as same as before (survey))

In [106]:
! qiime diversity adonis \
    --i-distance-matrix $Data_diversity/kmerizer-results-merged-sourdough-only/bray_curtis_distance_matrix_survey.qza \
    --m-metadata-file $Data_diversity/filtered-metadata/meta_pets.tsv  \
    --p-formula "guinea_pig*cat*dog*turtle*fish" \
    --o-visualization $Data_diversity/kmerizer-results-merged-sourdough-only/bray_curtis_pets.qzv

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results-merged-sourdough-only/bray_curtis_pets.qzv[0m
[0m[?25h

In [108]:
Visualization.load(f"{Data_diversity}/kmerizer-results-merged-sourdough-only/bray_curtis_pets.qzv")

Jaccard (no need to filter distance matrix, as same as before (survey))

In [107]:
! qiime diversity adonis \
    --i-distance-matrix $Data_diversity/kmerizer-results-merged-sourdough-only/jaccard_distance_matrix_survey.qza \
    --m-metadata-file $Data_diversity/filtered-metadata/meta_pets.tsv  \
    --p-formula "guinea_pig*cat*dog*turtle*fish" \
    --o-visualization $Data_diversity/kmerizer-results-merged-sourdough-only/jaccard_pets.qzv

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results-merged-sourdough-only/jaccard_pets.qzv[0m
[0m[?25h

In [109]:
Visualization.load(f"{Data_diversity}/kmerizer-results-merged-sourdough-only/jaccard_pets.qzv")

<div style="background-color: aliceblue; padding: 10px;">

Pets do not have an influence on fungal composition or abundance

<div style="background-color: skyblue; padding: 10px;">

## diversity bioenv trials to gain insight in most associated variables for beta diversity

In [None]:
! qiime diversity bioenv \
    --i-distance-matrix $Data_diversity/kmerizer-results-merged-sourdough-only/jaccard_distance_matrix.qza \
    --m-metadata-file $Data_raw/merged_output.tsv \
    --o-visualization $Data_diversity/kmerizer-results-merged-sourdough-only/jaccard-diversity-bioenv.qzv

  import pkg_resources


In [None]:
Visualization.load(f"{Data_diversity}/kmerizer-results-merged-sourdough-only/jaccard-diversity-bioenv.qzv")

<div style="background-color: aliceblue; padding: 10px;">

failed due to a lot of non numerical and unassigned values