# Metagenome sample Host genome contamination - belongs?

- In contamination analysis, **"human", "mouse", "chicken", "pig" are four most contaminated host genomes** among 3k metagenomes samples.
![Screenshot%202024-12-04%20at%2011.29.46.png](attachment:Screenshot%202024-12-04%20at%2011.29.46.png)
- we wonder **how that contamination matches the annotation of what host the metagenome is from**?
- So I aim to do the following analysis:
    - **merge** "3k.hg.csv" with "list.csv" for "biome2" info
    - **pull out metagenomes with >5% p_metag and count the # based on the "biome2" type**
    - An interactive version of the plot was created, allowing dynamic visualization adjustments based on user interactions, such as selecting specific metagenomes or host genomes.

In [42]:
from io import StringIO
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'notebook'

In [43]:
# Load the data
data = pd.read_csv('/Users/ZYZhao/projects/AL/onek/analysis/3k.hg.csv')
data

Unnamed: 0,query,p_genome,avg_abund,p_metag,metagenome name
0,bosTau9.fa.gz,0.00%,1.0,0.00%,ERR9456920
1,bosTau9.fa.gz,0.00%,1.2,0.00%,ERR3610816
2,bosTau9.fa.gz,0.00%,1.0,0.00%,SRR7664969
3,bosTau9.fa.gz,0.00%,20.2,0.10%,SRR12180980
4,bosTau9.fa.gz,0.00%,3.2,0.00%,SRR6713626
...,...,...,...,...,...
19440,susScr11.fa.gz,0.00%,7.9,0.10%,SRR5808817
19441,susScr11.fa.gz,0.00%,1.5,0.00%,SRR10810041
19442,susScr11.fa.gz,0.00%,2.6,0.00%,ERR2607371
19443,susScr11.fa.gz,0.00%,12.6,0.10%,SRR12180994


In [44]:
data_list = pd.read_csv('/Users/ZYZhao/projects/AL/onek/analysis/1_zyzhao-list-oct3.csv')

data_list_biome2 = data_list[['accession', 'biome2']]
data_list_biome2

Unnamed: 0,accession,biome2
0,SRR7299214,Engineered:Solid waste
1,SRR6490006,Host-associated:Microbial
2,SRR6490005,Host-associated:Microbial
3,SRR6490004,Host-associated:Microbial
4,SRR6490003,Host-associated:Microbial
...,...,...
3033,ERR3341539,Environmental:Aquatic
3034,ERR3341563,Environmental:Aquatic
3035,SRR6193124,Environmental:Aquatic
3036,SRR5268667,Environmental:Aquatic


In [45]:
# Merge the data on 'metagenome name' from data and 'assession' from data_list
merged_biome2 = pd.merge(data, data_list_biome2, left_on='metagenome name', right_on='accession', how='left')

# Display the merged data
print(merged_biome2.head())
merged_biome2.to_csv('merged_biome2.csv', index=False) 

           query p_genome  avg_abund p_metag metagenome name    accession  \
0  bosTau9.fa.gz    0.00%        1.0   0.00%      ERR9456920   ERR9456920   
1  bosTau9.fa.gz    0.00%        1.2   0.00%      ERR3610816   ERR3610816   
2  bosTau9.fa.gz    0.00%        1.0   0.00%      SRR7664969   SRR7664969   
3  bosTau9.fa.gz    0.00%       20.2   0.10%     SRR12180980  SRR12180980   
4  bosTau9.fa.gz    0.00%        3.2   0.00%      SRR6713626   SRR6713626   

                    biome2  
0    Environmental:Aquatic  
1  Host-associated:Mammals  
2  Host-associated:Mammals  
3  Host-associated:Insecta  
4   Engineered:Solid waste  


In [46]:
biome2 = pd.read_csv('/Users/ZYZhao/projects/AL/onek/analysis/merged_biome2.csv')
biome2

Unnamed: 0,query,p_genome,avg_abund,p_metag,metagenome name,accession,biome2
0,bosTau9.fa.gz,0.00%,1.0,0.00%,ERR9456920,ERR9456920,Environmental:Aquatic
1,bosTau9.fa.gz,0.00%,1.2,0.00%,ERR3610816,ERR3610816,Host-associated:Mammals
2,bosTau9.fa.gz,0.00%,1.0,0.00%,SRR7664969,SRR7664969,Host-associated:Mammals
3,bosTau9.fa.gz,0.00%,20.2,0.10%,SRR12180980,SRR12180980,Host-associated:Insecta
4,bosTau9.fa.gz,0.00%,3.2,0.00%,SRR6713626,SRR6713626,Engineered:Solid waste
...,...,...,...,...,...,...,...
19440,susScr11.fa.gz,0.00%,7.9,0.10%,SRR5808817,SRR5808817,Environmental:Aquatic
19441,susScr11.fa.gz,0.00%,1.5,0.00%,SRR10810041,SRR10810041,Host-associated:Insecta
19442,susScr11.fa.gz,0.00%,2.6,0.00%,ERR2607371,ERR2607371,Engineered:Wastewater
19443,susScr11.fa.gz,0.00%,12.6,0.10%,SRR12180994,SRR12180994,Host-associated:Insecta


In [47]:
biome2 = pd.read_csv('/Users/ZYZhao/projects/AL/onek/analysis/merged_biome2.csv')
biome2['p_metag'] = biome2['p_metag'].str.rstrip('%').astype('float') / 100

# Filter the DataFrame to include only rows where 'p_metag' is greater than 5%
filtered_data = biome2[biome2['p_metag'] > 0.05]

# Group the filtered data by 'biome2' and count the occurrences
biome2_counts = filtered_data['biome2'].value_counts()

# Display the count of metagenomes per 'biome2' type
print(biome2_counts)

Host-associated:Mammals    157
Host-associated:Human      126
Host-associated:Birds       16
Host-associated:Animal      11
Engineered:Wastewater        7
Host-associated:Fish         6
Engineered:Modeled           6
Environmental:Aquatic        1
Name: biome2, dtype: int64


In [82]:
# Interactive plot of host contamination, colored by Biome2 type
biome2 = pd.read_csv('/Users/ZYZhao/projects/AL/onek/analysis/merged_biome2.csv')
biome2['p_metag'] = biome2['p_metag'].str.rstrip('%').astype('float') / 100

# Ensure 'p_metag' is a float
biome2['p_metag'] = pd.to_numeric(biome2['p_metag'], errors='coerce')

# Filter the data for p_metag greater than 5%
filtered_high_p_metag = biome2[biome2["p_metag"] > 0.05]

# Create an interactive scatter plot
fig = px.scatter(
    filtered_high_p_metag,
    x="query",
    y="p_metag",
    
    color="biome2",  # Color code dots based on biome2 type
    title="Interactive Scatter Plot of Query vs P metag (p_metag > 5%)",
    labels={"query": "Query", "p_metag": "p_metag (%)"},
    hover_data=["metagenome name", "avg_abund", "p_metag"]
)

fig.show()

![newplot.png](attachment:newplot.png)

## analysis

- The interactive scatter plot of Query vs. p_metag (%) (p_metag > 5%) visualizes the contamination levels for different host genomes across various Biome2 categories. Here's an analysis of the results:

### Key Observations:
- Human (hg38.all) and Mouse (mm39.fa.gz) Contamination:
    - Human and mouse genomes show the highest contamination levels, with p_metag values exceeding 80% in multiple samples.
    - Both contaminants predominantly cluster in "Host-associated: Human" (orange) and "Host-associated: Mammals" (blue), as expected.
    - This aligns with their frequent use as host genomes in human clinical samples and mammalian experimental models.

- Cattle (bosTau9.fa.gz):
    - Cattle contamination shows moderate to high contamination in "Host-associated: Mammals" (blue).
    - There are a few low-level contamination points, suggesting sporadic detection.

- Bird (galGal6.fa.gz) and Other Hosts:
    - Bird contamination (galGal6.fa.gz) shows relatively lower p_metag values, with a few outliers reaching ~60% contamination.
    - Notably, contamination points here cluster in "Host-associated: Birds" (cyan), reflecting expected sources.
- Contamination from pig (susScr11.fa.gz), sheep (oviAri4.fa.gz), and dog (canFam6.fa.gz) is generally low, with only a handful of samples exceeding 20%.

- Unexpected Contamination:
    - A few samples in categories such as "Engineered: Wastewater" (purple) and "Engineered: Modeled" (light green) show mouse contamination, indicating potential cross-contamination during sample handling or processing.
    - Instances of human contamination in non-human biomes, such as "Host-associated: Fish" (green), highlight the importance of verifying unexpected results.

### Conclusion:
- Human (hg38.all) and mouse (mm39.fa.gz) are the most prominent contaminants, showing the highest contamination levels.
- The majority of contamination aligns with expected Biome2 categories (e.g., "Host-associated: Human" for human contamination), but anomalies in engineered or environmental categories suggest possible cross-contamination or misannotations.
- Rigorous quality control and contamination checks are essential to identify and mitigate such contamination for accurate downstream analysis in metagenomic studies.