# Find the closest reference genome used for the prediction of functional profile from PICRUST2
PICRUST2 utilises Hidden State Prediction based on the phylogenetic placement of ASV to reference genomes. This is achieved using the Castor R Package.
![HSP](https://pages.uoregon.edu/slouca/LoucaLab/SECTION_Software/MODULE_Software/CLASS_Software/UNIT_castor/Hidden%20state%20prediction%20-%20schematic.jpg) 

*Hidden state prediction - [From Louca Lab](http://www.loucalab.com/lib/php/index.php?section=Software&page=../../SECTION_Software/MODULE_Software/CLASS_Software/UNIT_castor/page.php)*

What I would like to do here is to find the closest reference genome used in the HSP. Or, given a sequence (ASV), what is the closest Genome Accession used as reference? 

From the [Castor docs](https://rdrr.io/cran/castor/man/find_nearest_tips.html):
> Langille et al. (2013) introduced the Nearest Sequenced Taxon Index (NSTI) as a measure for how well a set of microbial operational taxonomic units (OTUs) is represented by a set of sequenced genomes of related organisms. Specifically, the NSTI of a microbial community is the average phylogenetic distance of any OTU in the community, to the closest relative with an available sequenced genome ("target tips"). In analogy to the NSTI, the function find_nearest_tips provides a means to find the nearest tip (from a subset of target tips) to each tip and node in a phylogenetic tree, together with the corresponding phylogenetic ("patristic") distance.

This was raised as an issue [here](https://github.com/picrust/picrust2/issues/137)

In [1]:
# required env: ../environments/picrust2.yml
# raised from https://github.com/picrust/picrust2/issues/137
# Load Library
import rpy2 # uses this to communicate with R
from rpy2.robjects.packages import importr

import pandas as pd

import xml.etree.ElementTree as ET

# to use full block of R
%load_ext rpy2.ipython 

In [2]:
# Load NSTI from PICRUST Output
df_nsti = pd.read_csv('../data/picrust2/02-marker_predicted_and_nsti.tsv.gz', compression='gzip', header=0, sep='\t', quotechar='"', error_bad_lines=False)
target = df_nsti.sequence.values # use this as Castor input
df_nsti

Unnamed: 0,sequence,16S_rRNA_Count,metadata_NSTI
0,0003670ded0674981fb6ace93921cf9d,1,0.143177
1,0003c2463b98a6c28926dec4040423f0,1,0.064968
2,0004e798b8a4a52c71075db3de6b86a4,3,0.035405
3,0009e05d156dac5a1094ab2b0e532970,3,0.445430
4,0016aceac264cb884cf1188da48f7207,2,0.143107
...,...,...,...
8815,ffc40a7b5a34f75614720e13eb7e735c,1,0.117328
8816,ffc4c1468e8808c290dbb9d0e0f6792f,1,0.016047
8817,ffd9aad6b4181dbf96e3e4cbce0d95a7,1,0.199763
8818,ffef5b489ec0e9768fc90d508f8c85b2,1,0.065894


In [3]:
# this is just to get tip labels, im lazy
castor = importr('castor') # import R library
tree = castor.read_tree(file="../data/picrust2/01-out.tre") #read tree
index = list(tree.rx('tip.label')[0]) # return tip labels
! mkdir ../data/castor

mkdir: cannot create directory ‘../data/castor’: File exists


In [4]:
%%R -i target
# main script, uses castor find nearest tips
library(castor)
tree = read_tree(file="../data/picrust2/01-out.tre")
results = find_nearest_tips(tree, target_tips=target, only_descending_tips=FALSE)
write.csv(results$nearest_tip_per_tip,"../data/castor/nearest_tip_per_tip.csv")
write.csv(results$nearest_distance_per_tip,"../data/castor/nearest_distance_per_tip.csv")

In [5]:
# read single csv to df
df_distance = pd.read_csv('../data/castor/nearest_distance_per_tip.csv', index_col=0).rename(columns={'x':'distance'})
df_distance = df_distance.reset_index(drop=True)
df_tip = pd.read_csv('../data/castor/nearest_tip_per_tip.csv', index_col=0).rename(columns={'x':'tip'})
df_tip = df_tip.reset_index(drop=True)
df_otu = pd.DataFrame(data={'OTU':index})

In [6]:
# merge df
df = pd.DataFrame()
df['OTU'] = df_otu['OTU']
df['tip'] = df_tip['tip']
df['tip_name'] = [df.loc[i-1, 'OTU'] for i in df['tip']]
df['distance'] = df_distance['distance']
df
# many tips return itself, which caused the distance 0

Unnamed: 0,OTU,tip,tip_name,distance
0,59a5ae0bd557c087d0c9525fb4949e27,1,59a5ae0bd557c087d0c9525fb4949e27,0.000000
1,ff01e0ac3648b995aa934b592e701f2f,2,ff01e0ac3648b995aa934b592e701f2f,0.000000
2,6be58073516616141a67b30dead138e2,3,6be58073516616141a67b30dead138e2,0.000000
3,5680fc3fa41ecd524ac2cfdd955b2b21,4,5680fc3fa41ecd524ac2cfdd955b2b21,0.000000
4,6c6aa959abbcfc41570d48aa4ede1c56,5,6c6aa959abbcfc41570d48aa4ede1c56,0.000000
...,...,...,...,...
28815,646311953,28813,701c9c9d844cff83f23eabd4f2356ffb,0.161337
28816,3766935d7678d6874652897cd0c675d4,28817,3766935d7678d6874652897cd0c675d4,0.000000
28817,2574180439,28817,3766935d7678d6874652897cd0c675d4,0.298399
28818,2263196004,28817,3766935d7678d6874652897cd0c675d4,0.328516


In [7]:
# Here, I dropped all distance 0.
df_zero = df[df['distance'] == 0.0]
# df_zero['OTU'] == df_zero['tip_name'] # sanity check
df_not_zero = df[df['distance'] != 0.0]
df = df_not_zero
df # realize that I got 20000 of the reference OTUs

Unnamed: 0,OTU,tip,tip_name,distance
36,2619618878,35,50c512eb1a57126dd981a326ca388927,0.186186
43,2626542099,43,d685b6e97675282789c56d223667fd9a,0.165598
80,2654588091,66,e11a82bd3d75b89e208fe1f96689137c,1.032722
81,2713897351-cluster,66,e11a82bd3d75b89e208fe1f96689137c,1.159522
82,2708742863,66,e11a82bd3d75b89e208fe1f96689137c,1.170920
...,...,...,...,...
28814,2519899621,28814,0e72582aa1503fa96d3203b9d836df81,0.190038
28815,646311953,28813,701c9c9d844cff83f23eabd4f2356ffb,0.161337
28817,2574180439,28817,3766935d7678d6874652897cd0c675d4,0.298399
28818,2263196004,28817,3766935d7678d6874652897cd0c675d4,0.328516


In [8]:
# start asking the main question
ctr = 0
ctr2 = 0
singleton = []
multiple = []
for i in df['tip_name'].unique():
    length = len(df[df['tip_name'] == i])
    if length == 1:
        ctr = ctr + 1
        otu = df[df['tip_name'] == i]['OTU']
        dist = df[df['tip_name'] == i]['distance']
        singleton.append([length, i, [x for x in otu], [y for y in dist]])
    else:
        ctr2 = ctr2 + 1
        otu = df[df['tip_name'] == i]['OTU']
        dist = df[df['tip_name'] == i]['distance']
        #multiple.append([length, i, [x for x in otu][1:], [y for y in dist][1:]])
        multiple.append([length, i, [x for x in otu], [y for y in dist]])
print(len(df['tip_name'].unique()), ctr, ctr2)
# total unique ASV that matches with reference, how many are single, how many are multiple

929 368 561


In [9]:
df_multi = pd.DataFrame(data=multiple)
df_multi = df_multi.rename(columns={0:'n', 1:'ASV', 2:'matches', 3:'distances'})
df_multi

Unnamed: 0,n,ASV,matches,distances
0,6,e11a82bd3d75b89e208fe1f96689137c,"[2654588091, 2713897351-cluster, 2708742863, 2...","[1.032722, 1.159522, 1.1709200000000002, 1.147..."
1,39,33387b3b082c1cb40d019effa29e98c6,"[2504756006, 646311936, 650377953, 2698536703,...","[0.9547479999999999, 1.4589, 1.455725, 1.45001..."
2,369,830361b7928ed01e2f494ed42d26933f,"[2706794758, 2645728126-cluster, 2619618955, 2...","[1.40737, 1.350438, 1.354975, 1.658036, 1.6537..."
3,5,95eb0399e63cec1668e1820fc07f7e4e,"[2639762582-cluster, 2264867087, 2527291514, 2...","[1.320298, 1.320299, 1.333949, 1.317051, 1.317..."
4,61,386651ac8081ec8d53c94a2d501c3251,"[2517487012, 2599185146, 647000209, 2509601011...","[1.129001, 1.129002, 1.129002, 1.0930229999999..."
...,...,...,...,...
556,2,6888203136228be398e65a0c4e97e4f2,"[2706794988-cluster, 2703719286]","[0.038117000000000005, 0.03726]"
557,2,56a19686efe219e3a0b906cb873b6329,"[2602041625, 2728369433]","[0.727963, 0.520324]"
558,21,c0d3cd6b0d628a228e12159737786496,"[2619618830, 648028022, 646311919, 2554235449,...","[0.561346, 0.582526, 0.569747, 0.571593, 0.569..."
559,3,31a3d896ce185925e027d301fa29c074,"[2540341085, 2513020048, 2514752028]","[0.364526, 0.364526, 0.340221]"


In [10]:
df_single = pd.DataFrame(data=singleton)
df_single = df_single.rename(columns={0:'n', 1:'ASV', 2:'matches', 3:'distances'})
df_single

Unnamed: 0,n,ASV,matches,distances
0,1,50c512eb1a57126dd981a326ca388927,[2619618878],[0.186186]
1,1,d685b6e97675282789c56d223667fd9a,[2626542099],[0.165598]
2,1,8d5fad88b0dd620c7bfb85cea2923532,[2264813001-cluster],[0.5025390000000001]
3,1,8fb638fe966feb6b06aff0da107f0868,[2619618882],[0.193499]
4,1,b4e5724067cd09ae61b2b900684cb0f6,[2751185654],[0.090282]
...,...,...,...,...
363,1,bb2de46728b3c66eb58647b090b9b5cd,[2508501111],[0.107378]
364,1,3231841d647e9a665499853d7e36d926,[2615840657],[0.058460000000000005]
365,1,a61ecbf933785ed6de23582a28efbbc2,[646311962],[0.47163900000000003]
366,1,0e72582aa1503fa96d3203b9d836df81,[2519899621],[0.190038]


In [11]:
#df_nsti_single.sort_values(by='sequence')

In [12]:
#df_single.sort_values(by='ASV')

In [13]:
df_nsti_multi = df_nsti[df_nsti['sequence'].isin(df_multi['ASV'].values)]
df_nsti_single = df_nsti[df_nsti['sequence'].isin(df_single['ASV'].values)]
#df_nsti_multi.sort_values(by='sequence')

In [14]:
for i in df_multi.index:
    dist = df_multi.loc[i, 'distances']
    min_value = min(dist)
    min_index = dist.index(min_value)
    jgi = df_multi.loc[i, 'matches'][min_index]
    df_multi.loc[i, 'jgi_id'] = jgi
    df_multi.loc[i, 'NSTI'] = min_value
#df_multi.sort_values(by='ASV')

In [15]:
df_clean = df_multi.loc[:, ('ASV', 'jgi_id', 'NSTI')]
idx = len(df_clean)
jgi_id = [i[0] for i in df_single['matches']]
NSTI = [i[0] for i in df_single['distances']]
ASV = [i for i in df_single['ASV']]
for num, i in enumerate(ASV):
    df_clean.loc[idx+num, 'ASV'] = ASV[num]
    df_clean.loc[idx+num, 'jgi_id'] = jgi_id[num]
    df_clean.loc[idx+num, 'NSTI'] = NSTI[num]

In [16]:
df_clean
df_nsti_clean = df_nsti[df_nsti['sequence'].isin(df_clean['ASV'].values)]

In [17]:
df_nsti_clean

Unnamed: 0,sequence,16S_rRNA_Count,metadata_NSTI
23,00af243dd1c5bf2e740c310a4e9322ab,1,0.123784
34,00fd9e4b290fdd8149a809128deabdd3,1,0.330158
46,0150ccc6dc3013e388fc7ed636192f4f,1,0.030699
110,0357ea7cf4c91a22f3d462f9de07b9c4,1,0.157182
113,0361b43590cd1db7729d58fa83dedc92,1,0.015704
...,...,...,...
8773,fe99b49d7d039479db34d75fe233aaf3,1,0.208071
8789,ff3051226075855de923319c46777425,1,0.395589
8803,ff6aa0a5c3c9141e3189ba956bc87112,3,0.121629
8805,ff7ff7b7d97ef375cbd029ca6581c167,1,0.063296


In [18]:
df_clean

Unnamed: 0,ASV,jgi_id,NSTI
0,e11a82bd3d75b89e208fe1f96689137c,2713897374,0.948647
1,33387b3b082c1cb40d019effa29e98c6,2728369449,0.641192
2,830361b7928ed01e2f494ed42d26933f,2645728126-cluster,1.350438
3,95eb0399e63cec1668e1820fc07f7e4e,2264867079,1.317051
4,386651ac8081ec8d53c94a2d501c3251,643348543,0.933108
...,...,...,...
924,bb2de46728b3c66eb58647b090b9b5cd,2508501111,0.107378
925,3231841d647e9a665499853d7e36d926,2615840657,0.058460
926,a61ecbf933785ed6de23582a28efbbc2,646311962,0.471639
927,0e72582aa1503fa96d3203b9d836df81,2519899621,0.190038


In [2]:
df_compare = df_nsti_clean.merge(df_clean, left_on='sequence', right_on='ASV')
df_compare = df_compare.drop(columns=['ASV'])
df_tax = pd.read_csv('../data/qiime2/filtered/exported-feature-table/taxonomy.tsv', sep='\t')
df_tax = df_tax[df_tax['Feature ID'].isin(df_compare['sequence'].values)]
df_compare = df_compare.merge(df_tax, left_on='sequence', right_on='Feature ID')
df_compare = df_compare.drop(columns=['Feature ID'])
df_compare.to_csv('../tables/nsti.csv')

NameError: name 'df_nsti_clean' is not defined

In [1]:
df_compare.head(5)

NameError: name 'df_compare' is not defined

In [1]:
# INFO https://figshare.com/articles/dataset/JGI_PICRUSt_genomes_tar_bz2/12233192/2
! wget -O ../data/jgi-picrust-genomes/JGI_PICRUSt_genomes.tar.bz2 https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/22494503/JGI_PICRUSt_genomes.tar.bz2

--2021-04-18 05:48:52--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/22494503/JGI_PICRUSt_genomes.tar.bz2
Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 52.218.96.226
Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|52.218.96.226|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17167485379 (16G) [application/octet-stream]
Saving to: ‘../data/jgi-picrust-genomes/JGI_PICRUSt_genomes.tar.bz2’


2021-04-18 05:57:33 (31.4 MB/s) - ‘../data/jgi-picrust-genomes/JGI_PICRUSt_genomes.tar.bz2’ saved [17167485379/17167485379]



In [24]:
! samtools faidx ../data/jgi-picrust-genomes/JGI_PICRUSt_genomes.tar.bz2 2040502012

[E::fai_build_core] Different line length in sequence '(null)'
Could not load fai index of ../data/jgi-picrust-genomes/JGI_PICRUSt_genomes.tar.bz2


In [3]:
! tar -xOf ../data/jgi-picrust-genomes/JGI_PICRUSt_genomes.tar.bz2 JGI_PICRUSt_genomes.fasta \
        | sed -n -e '/2040502012/,/>/ p' > ../data/jgi-picrust-genomes/head2.txt

^C


In [10]:
! tar -xOf ../data/jgi-picrust-genomes/JGI_PICRUSt_genomes.tar.bz2 JGI_PICRUSt_genomes.fasta \
        | head -n 1 JGI_PICRUSt_genomes.fasta

head: cannot open 'JGI_PICRUSt_genomes.fasta' for reading: No such file or directory
tar: JGI_PICRUSt_genomes.fasta: Cannot write: Broken pipe
^C


In [19]:
! tar -x -f ../data/jgi-picrust-genomes/JGI_PICRUSt_genomes.tar.bz2 \
       --to-command="sed -n -e '/2040502012/,/>/ p' $*.fasta"

sed: can't read .fasta: No such file or directory
tar: 32136: Child returned status 2
sed: can't read .fasta: No such file or directory
^C


In [27]:
! tar -tvf ../data/jgi-picrust-genomes/JGI_PICRUSt_genomes.tar.bz2

-rw-r--r-- robynwright/wheel 176 2020-05-01 00:48 ._JGI_PICRUSt_genomes.fasta
-rw-r--r-- robynwright/wheel 61047459503 2020-05-01 00:48 JGI_PICRUSt_genomes.fasta
^C


In [12]:
! bzcat ../data/jgi-picrust-genomes/JGI_PICRUSt_genomes.tar.bz2 | wc -c

^C


In [6]:
! seqtk sample -s100 {$tar -xOf ../data/jgi-picrust-genomes/JGI_PICRUSt_genomes.tar.bz2 JGI_PICRUSt_genomes.fasta} 100 > ../data/jgi-picrust-genomes/sub1.fq

sample: invalid option -- 'x'
sample: invalid option -- 'O'
sample: invalid option -- 'f'
^C


Well, they are on maintenance:
> On April 11-16, 2021 the JGI computer systems will be undergoing maintenance and access to certain files and tools will be affected. File restore tape requests will be serviced until after the maintenance is complete. Sorry for the inconvenience.

> We’re soliciting feedback from JGI primary and data users on JGI Data Release and Utilization policies. Fill out our Request for Information by April 21.

In [None]:
! tar -xf ../data/jgi-picrust-genomes/JGI_PICRUSt_genomes.tar.bz2