# Metadata Overview

#### Description of data and aims:
In the summer of 2024, a mysterious disease dubbed the “pundemic” by the media began cropping up worldwide. Diseased patients make puns at every opportunity. A link between the pundemic and changes in the gut microbiome was discovered, and a doctor at the USZ set up a clinical trial using fecal microbiota transplants (FMT) as a possible treatment.

Trial data:  
Collection of fecal microbiome samples from pundemic patients before and after the trial, from both treatment and placebo groups. Pundemic severity in patients was quantified in terms of puns per hour. Fecal samples were collected from the FMT donors as well.

Because the bacterial and fungal gut microbiome are both of interest, the USZ team collected both **16S rRNA gene** and **ITS** data from the study cohort. 

Aims:
1. Analyzing the ITS data in order to further explore the connection between pundemic symptoms and an altered gut mycobiome composition
2. Analyzing the potential of FMT as a pandemic treatment option. You have received DNA sequences as well as metadata allowing you to distinguish pundemic from healthy samples.


## Metadata import and exploration

To get a better sense for the data we are working with, we do some preliminary exploration of the metadata here

In [4]:
# Import libraries
import os
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.backends.backend_pdf import PdfPages
import pandas as pd

In [5]:
# Metadata import, df creation and overview
if not os.path.exists('data/pundemic_metadata.tsv'):
    !wget -O data/pundemic_metadata.tsv https://polybox.ethz.ch/index.php/s/7LxWSbaw2q37yof/download

meta_df = pd.read_csv('data/pundemic_metadata.tsv', sep='\t', index_col=0)
meta_df['age'] = pd.to_numeric(meta_df['age'], errors='coerce')

# Sort the DataFrame by 'age'
meta_age_df = meta_df.sort_values(by='age')

In [25]:
# Investigate gender dfs
male_fmt_patients = meta_df[(meta_df.sex == "male") & (meta_df.disease_subgroup == 'FMT')]
female_fmt_patients = meta_df[(meta_df.sex == "female") & (meta_df.disease_subgroup == 'FMT')]

duplicate_patients_m = male_fmt_patients[male_fmt_patients.duplicated(subset='patient_id', keep=False)]
duplicate_patients_f = female_fmt_patients[female_fmt_patients.duplicated(subset='patient_id', keep=False)]

print("male_fmt_patients", male_fmt_patients['patient_id'].nunique())
print("female_fmt_patients", female_fmt_patients['patient_id'].nunique())

male_fmt_patients 16
female_fmt_patients 13


In [6]:
# Get fecal matter transplant (fmt) patient dfs
male_placebo_patients = meta_df[(meta_df.sex == "male") & (meta_df.disease_subgroup == 'Placebo')]
female_placebo_patients = meta_df[(meta_df.sex == "female") & (meta_df.disease_subgroup == 'Placebo')]

duplicate_patients_m = male_placebo_patients[male_placebo_patients.duplicated(subset='patient_id', keep=False)]
duplicate_patients_f = female_placebo_patients[female_placebo_patients.duplicated(subset='patient_id', keep=False)]


print("male_fmt_patients", male_placebo_patients['patient_id'].nunique())
print("female_fmt_patients", female_placebo_patients['patient_id'].nunique())

male_fmt_patients 10
female_fmt_patients 9


In [3]:
# Open the PDF file to save all the plots
with PdfPages('results/pundemic_plots_complete.pdf') as pdf:
    
    # Figure 1: Sex Ratio
    sex_counts = meta_df['sex'].value_counts()
    plt.figure(figsize=(6, 4))
    plt.bar(sex_counts.index, sex_counts.values, color=['skyblue', 'lightcoral', 'purple'])
    plt.xlabel('Sex')
    plt.ylabel('Count')
    #plt.title('Sex Ratio')
    plt.tight_layout()
    #plt.text(-0.5, -10, "Figure 1: Distribution of participants by sex.", fontsize=10)
    pdf.savefig()
    plt.close()

    # Figure 2: Overview of Disease Subgroups
    disease_counts = meta_df['disease_subgroup'].value_counts()
    plt.figure(figsize=(6, 4))
    plt.bar(disease_counts.index, disease_counts.values, color=['skyblue', 'lightcoral', 'purple'])
    plt.xlabel('Disease Subgroups')
    plt.ylabel('Count')
    #plt.title('Overview over Disease Subgroups')
    plt.tight_layout()
    #plt.text(-0.5, -10, "Figure 2: Distribution of participants across disease subgroups.", fontsize=10)
    pdf.savefig()
    plt.close()

    # Figure 3: Overview of Timepoints
    time_counts = meta_df['time_point'].value_counts()
    plt.figure(figsize=(6, 4))
    plt.bar(time_counts.index, time_counts.values, color=['skyblue', 'lightcoral', 'purple', 'darkblue'])
    plt.xlabel('Timepoints')
    plt.ylabel('Count')
    #plt.title('Overview over Timepoints')
    plt.tight_layout()
    #plt.text(-0.5, -10, "Figure 3: Distribution of participants across different timepoints.", fontsize=10)
    pdf.savefig()
    plt.close()

    # Figure 4: Ethnicity Distribution
    plt.figure(figsize=(6, 4))
    sns.countplot(data=meta_df, x='ethnicity', order=meta_df['ethnicity'].value_counts().index)
    #plt.title("Ethnicity Distribution")
    plt.xticks(rotation=45)
    plt.ylabel("Number of Participants")
    plt.tight_layout()
    #plt.text(-0.5, -10, "Figure 4: Ethnicity distribution of participants.", fontsize=10)
    pdf.savefig()
    plt.close()

    # Figure 5: Sex Distribution Across Groups
    plt.figure(figsize=(6, 4))
    sns.countplot(data=meta_df, x='sex', hue='group')
    #plt.title("Sex Distribution Across Groups")
    plt.ylabel("Number of Participants")
    plt.legend(title="Group")
    plt.tight_layout()
    #plt.text(-0.5, -10, "Figure 5: Sex distribution across Puns and Healthy groups.", fontsize=10)
    pdf.savefig()
    plt.close()

    # Figure 6: Puns per Hour Pre- and Post-Treatment
    plt.figure(figsize=(12, 6))
    sns.violinplot(data=meta_df, x='time_point', y='puns_per_hour_post_treatment', hue='group', split=True)
    #plt.title("Puns Per Hour Pre- and Post-Treatment")
    plt.ylabel("Puns per Hour (Post-Treatment)")
    plt.tight_layout()
    #plt.text(-0.5, -10, "Figure 6: Comparison of puns per hour pre- and post-treatment across groups.", fontsize=10)
    pdf.savefig()
    plt.close()

    # Figure 7: Blinded Clinical Response Analysis
    plt.figure(figsize=(6, 4))
    sns.boxplot(data=meta_df, x='blinded_clinical_response', y='puns_per_hour_post_treatment')
    #plt.title("Post-Treatment Puns per Hour by Clinical Response")
    plt.ylabel("Puns per Hour (Post-Treatment)")
    plt.tight_layout()
    #plt.text(-0.5, -10, "Figure 7: Post-treatment puns per hour by clinical response (NR or Res).", fontsize=10)
    pdf.savefig()
    plt.close()

    # Figure 8: Correlation Between Puns per Hour Pre- and Post-Treatment
    plt.figure(figsize=(6, 4))
    sns.scatterplot(data=meta_df, x='puns_per_hour_pre_treatment', y='puns_per_hour_post_treatment', hue='group')
    sns.regplot(data=meta_df, x='puns_per_hour_pre_treatment', y='puns_per_hour_post_treatment', scatter=False, color='black')
    #plt.title("Correlation Between Puns Per Hour Pre- and Post-Treatment")
    plt.xlabel("Puns per Hour (Pre-Treatment)")
    plt.ylabel("Puns per Hour (Post-Treatment)")
    plt.tight_layout()
    #plt.text(-0.5, -10, "Figure 8: Correlation between pre- and post-treatment puns per hour.", fontsize=10)
    pdf.savefig()
    plt.close()

    # Figure 9: Puns per Hour Across Disease Subgroups
    plt.figure(figsize=(6, 4))
    sns.boxplot(data=meta_df, x='disease_subgroup', y='puns_per_hour_post_treatment')
    #plt.title("Post-Treatment Puns per Hour by Disease Subgroup")
    plt.ylabel("Puns per Hour (Post-Treatment)")
    plt.tight_layout()
    #plt.text(-0.5, -10, "Figure 9: Puns per hour post-treatment by disease subgroups (FMT, Placebo, Donor).", fontsize=10)
    pdf.savefig()
    plt.close()

    # Figure 10: Age vs. Puns per Hour Post-Treatment
    plt.figure(figsize=(11, 4))
    sns.scatterplot(data=meta_age_df, x='age', y='puns_per_hour_post_treatment', hue='group')
    #plt.title("Age vs. Puns Per Hour Post-Treatment")
    plt.xlabel("Age")
    plt.ylabel("Puns per Hour (Post-Treatment)")
    plt.tight_layout()
    #plt.text(-0.5, -10, "Figure 10: Scatter plot showing the relationship between age and puns per hour post-treatment.", fontsize=10)
    pdf.savefig()
    plt.close()