Code to analyze the Beta diversity

In [2]:
# importing all required packages & notebook extensions at the start of the notebook
import os
import pandas as pd
import qiime2 as q2
from skbio import OrdinationResults
from qiime2 import Visualization
from seaborn import scatterplot

%matplotlib inline

In [3]:
#all variables
Data_raw='Data/raw'
Data_classified='Data/classified'
Data_diversity='Data/diversity'

## Creating the necessary files  
Diversity core metrics could be created on Jupyterhub

In [9]:
! qiime diversity core-metrics \
  --i-table $Data_classified/table-filtered.qza \
  --m-metadata-file $Data_raw/merged_output.tsv\
  --p-sampling-depth 3000 \
  --output-dir $Data_diversity/core-metrics-results-merged

  import pkg_resources
[32mSaved FeatureTable[Frequency] to: Data/diversity/core-metrics-results-merged/rarefied_table.qza[0m
[32mSaved SampleData[AlphaDiversity] to: Data/diversity/core-metrics-results-merged/observed_features_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: Data/diversity/core-metrics-results-merged/shannon_vector.qza[0m
[32mSaved SampleData[AlphaDiversity] to: Data/diversity/core-metrics-results-merged/evenness_vector.qza[0m
[32mSaved DistanceMatrix to: Data/diversity/core-metrics-results-merged/jaccard_distance_matrix.qza[0m
[32mSaved DistanceMatrix to: Data/diversity/core-metrics-results-merged/bray_curtis_distance_matrix.qza[0m
[32mSaved PCoAResults to: Data/diversity/core-metrics-results-merged/jaccard_pcoa_results.qza[0m
[32mSaved PCoAResults to: Data/diversity/core-metrics-results-merged/bray_curtis_pcoa_results.qza[0m
[32mSaved Visualization to: Data/diversity/core-metrics-results-merged/jaccard_emperor.qzv[0m
[32mSaved Visualization 

The following code was then submitted as a job on Euler, due to too little memory capacity on Jupyterhub

In [None]:
#!/bin/bash
#SBATCH --job-name=kmerizer
#SBATCH --time=04:00:00
#SBATCH --mem-per-cpu=32GB
#SBATCH --cpus-per-task=4
#SBATCH --output=kmerizer.log
source /cluster/home/nschwager/miniconda3/etc/profile.d/conda.sh
source ~/.bashrc
conda activate qiime2-amplicon-2025.10
qiime kmerizer core-metrics \
  --i-table /cluster/home/nschwager/kmerizer/table-filtered.qza \
  --i-sequences /cluster/home/nschwager/kmerizer/rep-seqs-filtered.qza \
  --m-metadata-file /cluster/home/nschwager/kmerizer/merged_output.tsv \
  --p-sampling-depth 3000 \
  --output-dir /cluster/scratch/home/kmerizer/kmerizer-results-merged

## Analysis of Metadata ITS

In [4]:
Visualization.load(f"{Data_diversity}/kmerizer-results-merged/scatterplot.qzv")

- Hand swabs and sourdough communities show different sets of fungis and different relative abundance
- there appears to be no difference between right & left hand
- there appears to be some clustering of plate P1-P4 and P5-P7 as well as for the DNA Extraction plate DNA55-DNA58 and DNA59-DNA61 and mostly between the projects highschool & highschool_hs

In [5]:
Visualization.load(f"{Data_diversity}/core-metrics-results-merged/bray_curtis_emperor.qzv")

**Comparison of the project**  
- Comparison of the projects highschool & highschool_hs shows a significant difference in the compositional similarity (bray curtis: p & q value of 0.001 and pseudo F-value of 302.679368)  
- similar results for the jaccard metric: p & q value of 0.001 and pseudo F-value of 136.490833 which indicates a high proportion of features that are not shared between the highschools

Bray-curtis

In [7]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $Data_diversity/kmerizer-results/bray_curtis_distance_matrix.qza \
    --m-metadata-file $Data_raw/20250913_metadata_ITS.tsv \
    --m-metadata-column project \
    --p-pairwise \
    --o-visualization $Data_diversity/kmerizer-results/bray_curtis-project-significance.qzv

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results/bray_curtis-project-significance.qzv[0m
[0m[?25h

In [8]:
Visualization.load(f"{Data_diversity}/kmerizer-results/bray_curtis-project-significance.qzv")

Jaccard

In [9]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $Data_diversity/kmerizer-results/jaccard_distance_matrix.qza \
    --m-metadata-file $Data_raw/20250913_metadata_ITS.tsv \
    --m-metadata-column project \
    --p-pairwise \
    --o-visualization $Data_diversity/kmerizer-results/jaccard-project-significance.qzv

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results/jaccard-project-significance.qzv[0m
[0m[?25h

In [10]:
Visualization.load(f"{Data_diversity}/kmerizer-results/jaccard-project-significance.qzv")

**Comparison of sample_type**  
- The pairwise permanova results for the Bray curtis metric show a significant difference between the hand swabs and the sourdough with a p value of 0.001, a q value of 0.002 and a pseudo F-value of 309.080532  
- The pariwise permanova results for the Jaccard metric point in the same direction as the sourdough and hand swabs comparison have a p-value of 0.001, q value of 0.001429 and a pseudo-F value of 145.174588

Bray curtis

In [11]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $Data_diversity/kmerizer-results/bray_curtis_distance_matrix.qza \
    --m-metadata-file $Data_raw/20250913_metadata_ITS.tsv \
    --m-metadata-column sample_type \
    --p-pairwise \
    --o-visualization $Data_diversity/kmerizer-results/bray_curtis-sample_type-significance.qzv

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results/bray_curtis-sample_type-significance.qzv[0m
[0m[?25h

In [12]:
Visualization.load(f"{Data_diversity}/kmerizer-results/bray_curtis-sample_type-significance.qzv")

Jaccard

In [13]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $Data_diversity/kmerizer-results/jaccard_distance_matrix.qza \
    --m-metadata-file $Data_raw/20250913_metadata_ITS.tsv \
    --m-metadata-column sample_type \
    --p-pairwise \
    --o-visualization $Data_diversity/kmerizer-results/jaccard-sample_type-significance.qzv

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results/jaccard-sample_type-significance.qzv[0m
[0m[?25h

In [14]:
Visualization.load(f"{Data_diversity}/kmerizer-results/jaccard-sample_type-significance.qzv")

**Comparison of hand**  
- as already assumed from the emperor visualization there is no statistical significant difference between the fungal composition of the right and left hand  
- Bray-Curtis: p-value: 0.766, q-value: 0.766, pseudo F-value: 0.746564  
- Jaccard: p & q value: 0.297 and pseude F-value: 1.024582

Bray-Curtis

In [15]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $Data_diversity/kmerizer-results/bray_curtis_distance_matrix.qza \
    --m-metadata-file $Data_raw/20250913_metadata_ITS.tsv \
    --m-metadata-column hand \
    --p-pairwise \
    --o-visualization $Data_diversity/kmerizer-results/bray_curtis-hand-significance.qzv

Visualization.load(f"{Data_diversity}/kmerizer-results/bray_curtis-hand-significance.qzv")

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results/bray_curtis-hand-significance.qzv[0m
[0m[?25h

Jaccard

In [16]:
! qiime diversity beta-group-significance \
    --i-distance-matrix $Data_diversity/kmerizer-results/jaccard_distance_matrix.qza \
    --m-metadata-file $Data_raw/20250913_metadata_ITS.tsv \
    --m-metadata-column hand \
    --p-pairwise \
    --o-visualization $Data_diversity/kmerizer-results/jaccard-hand-significance.qzv

Visualization.load(f"{Data_diversity}/kmerizer-results/jaccard-hand-significance.qzv")

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results/jaccard-hand-significance.qzv[0m
[0m[?25h

## Analysis of Metadata personal, environmental & sensory details

In [5]:
Visualization.load(f"{Data_diversity}/kmerizer-results-p-e-s/scatterplot.qzv")

In [17]:
Visualization.load(f"{Data_diversity}/core-metrics-results-p-e-s/bray_curtis_emperor.qzv")

**Comparison of background**  
- Non significant difference of background sterile/non sterile  
- Bray curtis: p & q value: 0.072, pseudo F-value: 1.525025  
- Jaccard: p- & q value: 0.047, pseudo F-value: 1.218399

Bray curtis

In [25]:
# Filter distance matrix to only include samples in metadata p-e-s
!qiime diversity filter-distance-matrix \
    --i-distance-matrix $Data_diversity/kmerizer-results-p-e-s/bray_curtis_distance_matrix.qza \
    --m-metadata-file $Data_diversity/merged_output.tsv \
    --o-filtered-distance-matrix $Data_diversity/kmerizer-results-p-e-s/bray_curtis_filtered.qza

!qiime diversity beta-group-significance \
    --i-distance-matrix $Data_diversity/kmerizer-results-p-e-s/bray_curtis_filtered.qza \
    --m-metadata-file $Data_diversity/merged_output.tsv \
    --m-metadata-column background \
    --p-pairwise \
    --o-visualization $Data_diversity/kmerizer-results-p-e-s/bray_curtis-background-significance.qzv

  import pkg_resources
[32mSaved DistanceMatrix to: Data/diversity/kmerizer-results-p-e-s/bray_curtis_filtered.qza[0m
  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results-p-e-s/bray_curtis-background-significance.qzv[0m
[0m[?25h

In [26]:
Visualization.load(f"{Data_diversity}/kmerizer-results-p-e-s/bray_curtis-background-significance.qzv")

Jaccard

In [23]:
# Filter distance matrix to only include samples in metadata p-e-s
!qiime diversity filter-distance-matrix \
    --i-distance-matrix $Data_diversity/kmerizer-results-p-e-s/jaccard_distance_matrix.qza \
    --m-metadata-file $Data_diversity/merged_output.tsv \
    --o-filtered-distance-matrix $Data_diversity/kmerizer-results-p-e-s/jaccard_filtered.qza

!qiime diversity beta-group-significance \
    --i-distance-matrix $Data_diversity/kmerizer-results-p-e-s/jaccard_filtered.qza \
    --m-metadata-file $Data_diversity/merged_output.tsv \
    --m-metadata-column background \
    --p-pairwise \
    --o-visualization $Data_diversity/kmerizer-results-p-e-s/jaccard-background-significance.qzv

  import pkg_resources
[32mSaved DistanceMatrix to: Data/diversity/kmerizer-results-p-e-s/jaccard_filtered.qza[0m
  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results-p-e-s/jaccard-background-significance.qzv[0m
[0m[?25h

In [24]:
Visualization.load(f"{Data_diversity}/kmerizer-results-p-e-s/jaccard-background-significance.qzv")

## Multivariate PERMANOVA test

To conduct it, metadata with no values needed to be filtered out in the metadata and in the distance matrix before further testing was possible

In [56]:
meta = pd.read_csv("Data/diversity/merged_output.tsv", sep="\t")

#removing rows without ethical agreement
rows_to_drop = [6, 15, 18, 29, 31, 35, 36, 37, 42, 44, 45, 47, 50, 51, 52]

meta_clean = meta.drop(index=[i-1 for i in rows_to_drop])

meta_clean.to_csv("Data/diversity/merged_output_ethical_agreement.tsv", sep="\t", index=False)

Removing spaces, slashes and - to prevent errors in r

In [25]:
!awk 'NR==1{n=split($0,a,"\t");for(i=2;i<=n;i++)gsub(/[ \/-]/,"_",a[i]);OFS="\t";$0=a[1];for(i=2;i<=n;i++)$0=$0 OFS a[i];print;next}1' "Data/diversity/merged_output_ethical_agreement.tsv" > "Data/diversity/merged_output_ethical_agreement_clean.tsv"


Removing columns with just 0 to prevent errors in adonis

In [31]:
meta_ethical_clean = pd.read_csv("Data/diversity/merged_output_ethical_agreement_clean.tsv", sep="\t", comment=None)

meta_ethical_clean = df.loc[:, (df != 0).any(axis=0)]

meta_ethical_clean.to_csv("Data/diversity/merged_output_ethical_agreement_nozeros.tsv", sep="\t", index=False)

Listing column names to get for adonis

In [20]:
print(list(meta_ethical_clean_nozeros.columns))


['sample_ID', 'person_id', 'start_time', 'completion_time', 'background', 'ethical_agreement', 'sd_bake_experience', 'sd_bake_last_time', 'yeast_bake_experience', 'yeast_bake_last_time', 'sd_stor_loc', 'sd_stor_temp', 'no_pets', 'guinea_pig', 'cat', 'dog', 'turtle', 'fish', 'pets', 'plants', 'plants_in_sd_room', 'siblings', 'age_siblings', 'other_fermentations', 'hands_disinfect', 'hands_wahsh_water', 'hands_wash_soap', 'hands_cream', 'handedness', 'biological_sex', 'age', 'hands_injuries', 'hands_injuries_treatments', 'diversity_assumption', 'latitude', 'longitude', 'house_type', 'garden', 'dist_agricultural_field', 'dist_farm', 'day7_pH', 'day7_TTA', 'day7_LAB', 'day7_yeast', 'day14_pH', 'day14_TTA', 'day14_LAB', 'day14_yeast', 'day21_pH', 'day21_TTA', 'day21_LAB', 'day21_yeast', 'day7_pH_home', 'day7_leavening', 'day7_aromas', 'day7_motivation', 'day7_observations', 'day14_pH_home', 'day14_leavening', 'day14_aromas', 'day14_motivation', 'day14_observations', 'day21_pH_home', 'day21_

Aroma analysis until day 21 (due to missing values from day 28)

In [27]:
! qiime diversity adonis \
    --i-distance-matrix $Data_diversity/kmerizer-results-p-e-s/jaccard_ethical_agreement.qza \
    --m-metadata-file $Data_diversity/merged_output_ethical_agreement_nozeros.tsv \
    --p-formula "ANIMAL_FEED_D7+ANIMAL_STABLE_D7+APPLE_D7+BREAD_D7+BUTTER_MILK_D7+BUTYRIC_ACID_D7+CAPERS_D7+CARAMEL_D7+CORN_D7+FARM_D7+GLUE_D7+GRAIN_FIELD_D7+HAY_D7+LACTIC_ACID_D7+MATURED_HARD_CHEESE_D7+MOIST_WOOD_D7+PAINT_D7+PEANUT_D7+PORRIDGE_D7+RICE_D7+SMOKED_D7+SOIL_D7+SOUR_CREAM_D7+SWEATY_FEET_D7+SYNTHETIC_D7+TOASTED_BREAD_D7+UNRIPE_FRUITS_D7+VEGETAL_D7+VINEGAR_D7+WHOLE_GRAIN_D7+YEASTY_D7+YOGHURT_D7+ALCOHOLIC_D14+ANIMAL_FEED_D14+APPLE_D14+BEER_D14+BREAD_D14+BUTTER_MILK_D14+BUTYRIC_ACID_D14+CARAMEL_D14+FARM_D14+GRAIN_FIELD_D14+HAZELNUT_D14+LACTIC_ACID_D14+MATURED_HARD_CHEESE_D14+MOIST_WOOD_D14+MOLDY_D14+PICKLED_VEGETABLES_D14+PINEAPPLE_D14+PORRIDGE_D14+RICE_D14+ROOT_VEGETABLES_D14+SMOKED_D14+SOUR_CREAM_D14+SWEATY_FEET_D14+TOASTED_BREAD_D14+UNRIPE_FRUITS_D14+VEGETAL_D14+VINEGAR_D14+WHOLE_GRAIN_D14+YEASTY_D14+YOGHURT_D14+ANIMAL_FEED_D21+BANANA_D21+BEER_D21+BREAD_D21+BUTTER_MILK_D21+BUTYRIC_ACID_D21+CAPERS_D21+CHICKPEA_D21+CORN_D21+FARM_D21+GRAIN_FIELD_D21+HAY_D21+LACTIC_ACID_D21+MATURED_HARD_CHEESE_D21+MOIST_WOOD_D21+MOLDY_D21+PAINT_D21+PICKLED_VEGETABLES_D21+PORRIDGE_D21+SMOKED_D21+SOUR_CREAM_D21+SWEATY_FEET_D21+SYNTHETIC_D21+UNRIPE_FRUITS_D21+VEGETAL_D21+VINEGAR_D21+WHOLE_GRAIN_D21+YEASTY_D21+YOGHURT_D21" \
    --o-visualization $Data_diversity/kmerizer-results-p-e-s/jaccard_multi_aroma.qzv

  import pkg_resources
[31m[1mThere was an issue with loading the file Data/diversity/merged_output_ethical_agreement_nozeros.tsv as metadata:

  Found unrecognized ID column name 'sample_ID' while searching for header. The first column name in the header defines the ID column, and must be one of these values:

  Case-insensitive: 'feature id', 'feature-id', 'featureid', 'id', 'sample id', 'sample-id', 'sampleid'

  Case-sensitive: '#OTU ID', '#OTUID', '#Sample ID', '#SampleID', 'sample_name'

  NOTE: Metadata files must contain tab-separated values.

  There may be more errors present in the metadata file. To get a full report, sample/feature metadata files can be validated with Keemei: https://keemei.qiime2.org

  Find details on QIIME 2 metadata requirements here: https://docs.qiime2.org/2025.7/tutorials/metadata/[0m

[0m[?25h

In [1]:
! qiime diversity adonis --help

Usage: [94mqiime diversity adonis[0m [OPTIONS]

  Determine whether groups of samples are significantly different from one
  another using the ADONIS permutation-based statistical test in vegan-R. The
  function partitions sums of squares of a multivariate data set, and is
  directly analogous to MANOVA (multivariate analysis of variance). This
  action differs from beta_group_significance in that it accepts R formulae to
  perform multi-way ADONIS tests; beta_group_signficance only performs one-way
  tests. For more details, consult the reference manual available on the CRAN
  vegan page: https://CRAN.R-project.org/package=vegan

[1mInputs[0m:
  [94m[4m--i-distance-matrix[0m ARTIFACT
    [32mDistanceMatrix[0m       Matrix of distances between pairs of samples.
                                                                    [35m[required][0m
[1mParameters[0m:
  [94m[4m--m-metadata-file[0m METADATA...
    (multiple            Sample metadata containing formula terms.

In [13]:
Visualization.load(f"{Data_diversity}/kmerizer-results-p-e-s/jaccard_multi_aroma.qzv")

**latitude & longitude**  
- there is a statistical significane along longitude (p=0.001) of the microbial diversity for the Jaccard  
- not confirmed through Bray-Curtis (p=0.091)

Jaccard

In [50]:
!qiime diversity filter-distance-matrix \
    --i-distance-matrix $Data_diversity/kmerizer-results-p-e-s/jaccard_distance_matrix.qza \
    --m-metadata-file $Data_diversity/merged_output_ethical_agreement.tsv \
    --o-filtered-distance-matrix $Data_diversity/kmerizer-results-p-e-s/jaccard_ethical_agreement.qza

  import pkg_resources
[32mSaved DistanceMatrix to: Data/diversity/kmerizer-results-p-e-s/jaccard_ethical_agreement.qza[0m
[0m[?25h

In [51]:
! qiime diversity adonis \
    --i-distance-matrix $Data_diversity/kmerizer-results-p-e-s/jaccard_ethical_agreement.qza \
    --m-metadata-file $Data_diversity/merged_output_ethical_agreement.tsv \
    --p-formula "latitude*longitude" \
    --o-visualization $Data_diversity/kmerizer-results-p-e-s/jaccard_multi_place.qzv

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results-p-e-s/jaccard_multi_place.qzv[0m
[0m[?25h

In [52]:
Visualization.load(f"{Data_diversity}/kmerizer-results-p-e-s/jaccard_multi_place.qzv")

Bray Curtis

In [53]:
!qiime diversity filter-distance-matrix \
    --i-distance-matrix $Data_diversity/kmerizer-results-p-e-s/bray_curtis_distance_matrix.qza \
    --m-metadata-file $Data_diversity/merged_output_ethical_agreement.tsv \
    --o-filtered-distance-matrix $Data_diversity/kmerizer-results-p-e-s/bray_curtis_ethical_agreement.qza

  import pkg_resources
[32mSaved DistanceMatrix to: Data/diversity/kmerizer-results-p-e-s/bray_curtis_ethical_agreement.qza[0m
[0m[?25h

In [54]:
! qiime diversity adonis \
    --i-distance-matrix $Data_diversity/kmerizer-results-p-e-s/bray_curtis_ethical_agreement.qza \
    --m-metadata-file $Data_diversity/merged_output_ethical_agreement.tsv \
    --p-formula "latitude*longitude" \
    --o-visualization $Data_diversity/kmerizer-results-p-e-s/bray_curtis_multi_place.qzv

  import pkg_resources
[32mSaved Visualization to: Data/diversity/kmerizer-results-p-e-s/bray_curtis_multi_place.qzv[0m
[0m[?25h

In [55]:
Visualization.load(f"{Data_diversity}/kmerizer-results-p-e-s/bray_curtis_multi_place.qzv")

**plants, pH & TTA**  
- none of these have an influence on diversity alone and as interactions (can be seen as p-Value is 1)

In [45]:
meta = pd.read_csv("Data/diversity/merged_output_ethical_agreement.tsv", sep="\t")
#removing rows without values for pH
rows_to_drop = [4, 9, 13, 24, 28, 29, 33, 36]

meta_clean = meta.drop(index=[i-1 for i in rows_to_drop])

meta_clean.to_csv("Data/diversity/merged_output_plants_pH.tsv", sep="\t", index=False)

In [46]:
!qiime diversity filter-distance-matrix \
    --i-distance-matrix $Data_diversity/kmerizer-results-p-e-s/jaccard_distance_matrix.qza \
    --m-metadata-file $Data_diversity/merged_output_plants_pH.tsv \
    --o-filtered-distance-matrix $Data_diversity/kmerizer-results-p-e-s/jaccard_filtered_plants_pH.qza

  import pkg_resources
  series = series.replace('', np.nan).infer_objects(copy=False)
[32mSaved DistanceMatrix to: Data/diversity/kmerizer-results-p-e-s/jaccard_filtered_plants_pH.qza[0m
[0m[?25h

In [47]:
! qiime diversity adonis \
    --i-distance-matrix $Data_diversity/kmerizer-results-p-e-s/jaccard_filtered_plants_pH.qza \
    --m-metadata-file $Data_diversity/merged_output_plants_pH.tsv \
    --p-formula "plants*day7_pH*day14_pH*day21_pH*day7_TTA*day14_TTA*day21_TTA" \
    --o-visualization $Data_diversity/kmerizer-results-p-e-s/jaccard_multi_plants.qzv

  import pkg_resources
  series = series.replace('', np.nan).infer_objects(copy=False)
[32mSaved Visualization to: Data/diversity/kmerizer-results-p-e-s/jaccard_multi_plants.qzv[0m
[0m[?25h

In [48]:
Visualization.load(f"{Data_diversity}/kmerizer-results-p-e-s/jaccard_multi_plants.qzv")