# Accelerating metagenomic analysis with [Graphistry](graphistry.com) focusing on viral tracing over time

## [viral calling pipeline here](https://github.com/dcolinmorgan/viral_snake)

Using GPU-accelerated UMAP + DBScan analysis & visualization, metagenomic samples' bacterial compositions can be clustered and compared faster and much more easily explored.

*   Task: Analyze metagenomic samples for similarity
*   Data: time series samples
**   563 samples collected from 84 donors, producing 4 dense long-term time series (up to 1 sample every other day during 18 months)
*   [data](https://www.ebi.ac.uk/ena/browser/view/PRJNA544527)
*   [metadata](https://static-content.springer.com/esm/art%3A10.1038%2Fs41591-019-0559-3/MediaObjects/41591_2019_559_MOESM3_ESM.xlsx)
*   [paper](https://sci-hub.se/10.1038/s41591-019-0559-3)


**Insight/ Result:**

43s to umap and dbscan vs 2342s here
over **50X** faster for a single run, and since [the reference paper for this analysis](https://journals.asm.org/doi/full/10.1128/msystems.00118-23) runs this analysis 12x per dataset (here we only have 1 dataset), we could expect to save nearly the entire 8hrs for this dataset, taking less than 10 minutes in total

# Setup

In [None]:
!pip install -q --extra-index-url=https://pypi.nvidia.com cuml-cu12
import cuml,cudf
print(cuml.__version__)

!pip -q install graphistry[ai]

!pip install -q Biopython

In [2]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

# import /configure

get a free api-key at https://www.graphistry.com/


In [3]:
from google.colab import userdata
g_user=userdata.get('g_user')
g_pass=userdata.get('g_pass')

In [19]:
import numpy as np
import pandas as pd
import graphistry
from time import time

graphistry.register(api=3,protocol="https", server="hub.graphistry.com", username=g_user, password=g_pass) ## key id, secret key
graphistry.__version__

import cuml,cudf
print(cuml.__version__)

24.06.01


# bio-ml dataset


1.   [3 subjects x 10 time points](
https://www.ebi.ac.uk/ena/browser/view/PRJNA544527)

2.  [metadata](
https://static-content.springer.com/esm/art%3A10.1038%2Fs41591-019-0559-3/MediaObjects/41591_2019_559_MOESM3_ESM.xlsx)

3.   !wget https://raw.githubusercontent.com/dcolinmorgan/grph/main/ftp_PRJNA544527.txt


In [None]:
!unzip -o PRJNA544527_mpa4out.txt.zip

In [5]:
%%bash
if [ ! -f PRJNA544527_mpa4out.txt ]; then
    !wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR922/006/SRR9224006/SRR9224006_1.fastq.gz
    !wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR922/006/SRR9224006/SRR9224006_2.fastq.gz

    !gunzip SRR9224006_1.fastq.gz
    !gunzip SRR9224006_2.fastq.gz

    !head /content/SRR9224006_1.fastq
fi

In [6]:
import os
if not os.path.exists('PRJNA544527_mpa4out.txt'):
    from Bio import SeqIO
    import glob,os
    import pandas as pd
    B=pd.DataFrame()
    for i in glob.glob('/content/*.fastq'):
        # j=os.path.basename(i)
        fasta_sequences = SeqIO.parse(open(i),'fastq')
        identifiers = []
        sequences = []
        for fasta in fasta_sequences:
            name, sequence = fasta.id, str(fasta.seq)
            identifiers.append(name)
            sequences.append(sequence)

        A=pd.DataFrame([identifiers,sequences]).T
        A.columns=['ID','seq']
        A.dropna(inplace=True)
        B=B.append(A)
        # A['ID']#=A.ID.str.split('-')[0:1]
    # B['ID']=B['ID'].str.split('-').str[0]+'_'+B['ID'].str.split('-').str[1]#.cat()
    B['ID']=B.ID.str.split('_length').str[0]
    B.index=B.ID

# install [HUMAnN 3](https://huttenhower.sph.harvard.edu/humann), a method for efficiently and accurately profiling the abundance of microbial metabolic pathways and other molecular functions from metagenomic or metatranscriptomic sequencing data.

### takes very long for running all samples
 (1day+ run on cluster)

In [7]:
%%bash

if [ ! -f PRJNA544527_mpa4out.txt ]; then

    pip install humann --no-binary :all:
    pip install metaphlan

    humann_databases --download utility_mapping full /path/to/databases --update-config yes

    # humann_test
    wget https://github.com/biobakery/humann/raw/master/examples/demo.fastq.gz
    humann -i demo.fastq.gz -o sample_results


    mkdir assemble epi_sam_out mpa4_out
    humann -i /content/All_MAGs/Sample_101_S75_bin_1.fa -o test_out


    seq=$(ls /content/*.fastq | cut -d / -f2| cut -d _ -f1)

    for i in $(eval "echo "$seq" | cut -d _ -f1")

    do
    metaphlan /content/${i}.fa --nproc 40 --input_type fasta -o /content/assemble/${i}/h4_out.txt -t rel_ab_w_read_stats
    done
fi

# umap and dbscan

idea for metagenomic analysis based on [Quantifying Shared and Unique Gene Content across 17 Microbial Ecosystems
](https://journals.asm.org/doi/full/10.1128/msystems.00118-23)

(analyze all samples run on cluster)

also this [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x?ref=https://codemonkey.link#Sec7) and [method](https://github.com/marbl/Mash/blob/master/INSTALL.txt)



In [8]:
data=pd.read_csv('/content/PRJNA544527_mpa4out.txt',sep='\t',skiprows=1,index_col=0)
data.index=data.reset_index().clade_name.str.split('|',expand=True)[6]
data=data.reset_index().dropna(axis=0)
data.index=data[6]
data=data.drop(columns=6)

!wget https://raw.githubusercontent.com/dcolinmorgan/grph/main/PRJNA544527-meta_inf.txt
meta=pd.read_csv('/content/PRJNA544527-meta_inf.txt',sep='\t',header=None)

mm=pd.merge(data.T,meta[[3,5]],left_index=True,right_on=3)

mm['id']=mm[5].str.split('-').str[0]
mm['time']=mm[5].str.split('_').str[0].str.split('-').str[1]

!wget https://static-content.springer.com/esm/art%3A10.1038%2Fs41591-019-0559-3/MediaObjects/41591_2019_559_MOESM3_ESM.xlsx
metaa=pd.read_excel('/content/41591_2019_559_MOESM3_ESM.xlsx',sheet_name='SupTable2',skiprows=3)
metaa=metaa[['Donor','Age','Sex','BMI']]

Full_table=pd.merge(mm,metaa,left_on='id',right_on='Donor')
Full_table=Full_table.drop(columns=[3,	5,	'id'])

data2=Full_table.melt(id_vars=['time','Donor','Age','Sex','BMI'])

data2=data2.rename(columns={'variable':'species'})
data2=data2.sort_values(by=['Donor','time','value'])

--2024-07-09 07:23:29--  https://raw.githubusercontent.com/dcolinmorgan/grph/main/PRJNA544527-meta_inf.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 45603 (45K) [text/plain]
Saving to: ‘PRJNA544527-meta_inf.txt’


2024-07-09 07:23:29 (6.40 MB/s) - ‘PRJNA544527-meta_inf.txt’ saved [45603/45603]

--2024-07-09 07:23:29--  https://static-content.springer.com/esm/art%3A10.1038%2Fs41591-019-0559-3/MediaObjects/41591_2019_559_MOESM3_ESM.xlsx
Resolving static-content.springer.com (static-content.springer.com)... 151.101.0.95, 151.101.64.95, 151.101.128.95, ...
Connecting to static-content.springer.com (static-content.springer.com)|151.101.0.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3556857 (3.4M) [application/octet-stream]
Sa

  warn(msg)


In [9]:
(data2.species)+'_'+(data2.Donor)

2678                s__Bacteroides_clarus_aa
5378          s__Bacteroides_intestinalis_aa
9158               s__Ruminococcus_bromii_aa
12938                  s__GGB6601_SGB9333_aa
13478                  s__GGB3256_SGB4303_aa
                         ...                
86343     s__Faecalibacterium_prausnitzii_dl
2103          s__Phocaeicola_massiliensis_dl
67983         s__Phocaeicola_massiliensis_dl
5883      s__Faecalibacterium_prausnitzii_dl
178143            s__Phocaeicola_plebeius_dl
Length: 208440, dtype: object

In [10]:
data2[data2['value']>1]

Unnamed: 0,time,Donor,Age,Sex,BMI,species,value
39938,0154,aa,29,Male,24.1,s__Desulfovibrio_piger,1.00422
11318,0154,aa,29,Male,24.1,s__Odoribacter_splanchnicus,1.12785
77198,0154,aa,29,Male,24.1,s__Odoribacter_splanchnicus,1.12785
73418,0154,aa,29,Male,24.1,s__Faecalibacterium_prausnitzii,1.14483
183578,0154,aa,29,Male,24.1,s__GGB3304_SGB4367,1.24406
...,...,...,...,...,...,...,...
86343,0006,dl,32,Male,26.1,s__Faecalibacterium_prausnitzii,2.21002
2103,0006,dl,32,Male,26.1,s__Phocaeicola_massiliensis,3.84088
67983,0006,dl,32,Male,26.1,s__Phocaeicola_massiliensis,3.84088
5883,0006,dl,32,Male,26.1,s__Faecalibacterium_prausnitzii,4.37472


In [17]:
data

Unnamed: 0_level_0,SRR9224004,SRR9224005,SRR9224006,SRR9224007,SRR9224008,SRR9224009,SRR9224010,SRR9224011,SRR9224012,SRR9224013,...,SRR9224554,SRR9224555,SRR9224556,SRR9224557,SRR9224558,SRR9224560,SRR9224562,SRR9224563,SRR9224564,SRR9224566
clade_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
k__Bacteria,100.00000,100.00000,100.00000,99.99325,99.99121,100.00000,100.00000,100.00000,100.00000,99.97112,...,99.96449,99.76391,99.03395,99.93944,100.00000,100.00000,100.00000,100.00000,100.00000,100.00000
k__Bacteria|p__Bacteroidetes,61.23730,44.02919,66.08642,80.31334,82.32195,76.81303,71.32438,79.69819,89.64043,75.12163,...,67.00723,61.02803,63.98730,67.10925,73.61819,50.41596,73.19007,69.42130,66.79971,63.91885
k__Bacteria|p__Firmicutes,35.28431,52.33158,29.94238,18.42032,17.14698,20.21463,26.38116,17.38671,8.61748,23.37514,...,28.01037,32.63114,29.99650,27.97017,22.53294,46.12990,24.15388,27.94324,30.31034,33.05760
k__Bacteria|p__Proteobacteria,2.94544,1.77397,3.50557,1.09038,0.43810,1.98629,1.88072,1.85886,1.39336,0.81315,...,1.66858,1.26552,2.61714,3.06149,2.72646,3.06176,2.24794,2.37410,2.53381,2.69303
k__Bacteria|p__Actinobacteria,0.40957,1.76681,0.46563,0.15281,0.07368,0.23584,0.39492,1.05624,0.34873,0.65842,...,1.41942,2.36773,1.64163,1.14980,0.80313,0.38644,0.33297,0.21942,0.35367,0.32526
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__GGB3167|s__GGB3167_SGB4181|t__SGB4181,0.00000,0.00000,0.00000,0.74127,0.26387,0.00000,0.00000,0.00000,0.00000,0.00000,...,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Gemmiger|s__Gemmiger_formicilis|t__SGB15300,0.00000,0.00000,0.00000,0.66557,0.85196,0.00000,0.00000,0.00000,0.00000,0.05111,...,0.00000,0.00000,0.93250,2.05483,2.34251,0.00000,0.00000,0.00000,0.00000,0.00000
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Oscillibacter|s__Oscillibacter_sp_ER4|t__SGB15254,0.00000,0.00000,0.00000,0.59736,0.33949,0.00000,0.00000,0.00000,0.00000,0.00000,...,0.38830,0.21617,1.05106,0.82122,0.00000,0.00000,0.00000,0.00000,0.00000,0.00000
k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Clostridiaceae|g__Clostridium|s__Clostridium_sp_AF34_10BH|t__SGB4914,0.00000,0.00000,0.00000,0.54644,2.03641,0.00000,0.00000,0.01345,0.00000,0.14019,...,0.11002,0.11973,0.16318,1.17502,0.23408,0.03389,0.00000,0.00000,0.00000,0.00000


## UMAP by species


In [38]:
data=pd.read_csv('/content/PRJNA544527_mpa4out.txt',sep='\t',skiprows=1,index_col=0)

g = graphistry.nodes(cudf.from_pandas(data.dropna()))

t=time()
g3=g.umap(dbscan=True,engine='umap_learn')
print('\n Total ', np.round(time() - t,1), 'seconds passed')

emb2=g3._node_embedding
g22=graphistry.nodes(emb2.reset_index(),'index').edges(g3._edges,'_src_implicit','_dst_implicit').bind(point_x="x",point_y="y").settings(url_params={"play":0})




 Total  24.8 seconds passed


## UMAP by species via GPU

In [28]:
data=pd.read_csv('/content/PRJNA544527_mpa4out.txt',sep='\t',skiprows=1,index_col=0)

g = graphistry.nodes(cudf.from_pandas(data.dropna()))

t=time()
g3=g.umap(dbscan=True,engine='cuml')
print('\n Total ', np.round(time() - t,1), 'seconds passed')

emb2=g3._node_embedding
g22=graphistry.nodes(emb2.reset_index(),'index').edges(g3._edges,'_src_implicit','_dst_implicit').bind(point_x="x",point_y="y").settings(url_params={"play":0})




 Total  0.7 seconds passed


In [29]:
data.index

Index(['k__Bacteria', 'k__Bacteria|p__Bacteroidetes',
       'k__Bacteria|p__Firmicutes', 'k__Bacteria|p__Proteobacteria',
       'k__Bacteria|p__Actinobacteria', 'k__Bacteria|p__Lentisphaerae',
       'k__Bacteria|p__Bacteria_unclassified',
       'k__Bacteria|p__Bacteroidetes|c__Bacteroidia',
       'k__Bacteria|p__Firmicutes|c__Clostridia',
       'k__Bacteria|p__Proteobacteria|c__Betaproteobacteria',
       ...
       'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__GGB9719|s__GGB9719_SGB15273',
       'k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides_stercoris|t__SGB1830',
       'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Clostridiaceae|g__GGB3175|s__GGB3175_SGB4191|t__SGB4191',
       'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15326',
       'k__Bacteria|p__Bacteroidetes|c__Bacter

In [30]:
A=emb2.reset_index()['clade_name'].to_pandas().str.split('|').str[6]#.str.split('|').str[0]#+'_'+data2.Donor
emb2.index=A

In [31]:
B=g3._edges
B['_src_implicit'] = B['_src_implicit'].replace(A, regex=True)#apply(lambda x: [A[i] for i in x])
B['_dst_implicit'] = B['_dst_implicit'].replace(A, regex=True)

## add this regex replacement rather than merge

In [32]:
g22=graphistry.nodes(emb2.reset_index(),'clade_name').edges(g3._edges.dropna(),'_src_implicit','_dst_implicit').bind(point_x="x",point_y="y").settings(url_params={"play":0})

g22.plot()

## UMAP for patient by time stamp

In [34]:
data2=data2[data2.value>0]
data2=data2.reset_index(drop = True)
data2=data2.drop_duplicates()

data2["Label"] = (
    data2.groupby("Donor")
    .apply(lambda x: x.groupby("time", sort=False).ngroup() + 1)
    .values
)

cc=pd.unique(data2[data2.Label<5].Donor)
data2=data2.loc[ data2.Donor.isin(cc), : ]
data2=data2[data2.Label<5]

data2["rank"] = data2.groupby("Donor")["value"].rank(method="dense", ascending=False)
data2=data2[data2['rank']<10.0]


In [35]:
data2['id_time']=data2['Donor']+'_'+data2['Label'].apply(str)

In [36]:
data3=data2[['id_time','species','value']]

In [37]:
df2 = data3.pivot_table(index=['id_time'],columns='species')
df3=df2.fillna(0).reset_index()
df4=df3.droplevel(0, axis=1)
df4.index=df4.iloc[:,0]
df4=df4.loc[:, df4.columns.str.startswith('s__')]

g = graphistry.nodes(cudf.from_pandas(df4))

t=time()

g3=g.umap(dbscan=True,engine='cuml')
print('\n Total ', np.round(time() - t,1), 'seconds passed')

emb2=g3._node_embedding

A=emb2.reset_index().iloc[:,0].to_pandas()
emb2.index=A

B=g3._edges
B['_src_implicit'] = B['_src_implicit'].replace(A, regex=True)
B['_dst_implicit'] = B['_dst_implicit'].replace(A, regex=True)

emb2['id_time']=emb2.index
g22=graphistry.nodes(emb2.reset_index(),'id_time').edges(g3._edges.dropna(),'_src_implicit','_dst_implicit').bind(point_x="x",point_y="y").settings(url_params={"play":0})

g22.plot()




 Total  0.3 seconds passed
