The `paper_metadata.csv` file was downloaded from [here](https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-019-10092-5/MediaObjects/41467_2019_10092_MOESM5_ESM.xlsx)  
It was converted to csv, the first two lines were removed and the values in Sequencing ID where used to fill the ENA accession column where it was missing  
The values in ENA accession of the Ashton Study were used to download the sequencing data with the downloading-tools workflow.  
The `reads_tables.csv` file was created by the downloading-tools workflow.  
The run ERR2624135 was not possible to download.  

In [1]:
import pandas as pd
from Bio import Phylo

In [2]:
original = pd.read_csv('paper_metadata.csv', header = 0)

In [3]:
reads = pd.read_csv('reads_table.csv', header = 0)
reads = reads[['sample', 'run']]

Join the metadata and the read accessions to have a sample name and get the strain name from the Sequencing ID column

In [4]:
joined = pd.merge(original, reads, left_on = 'ENA accession', right_on = 'run', how = 'left')
joined = joined.drop(columns = ['run'])
joined = joined.rename(columns = {'Sequencing ID': 'strain', 'Isolation source': 'source', 'Sub-clade' : 'VNI_subdivision', 'Country of origin': 'Country_of_origin', 'ENA accession': 'SRA_Accession'})

Assign lineage from the colummn `Species ID from mash analysis`

In [5]:
joined.loc[joined['Species ID from mash anlaysis'] == 'Cryptococcus neoformans var. grubii H99', 'lineage'] = 'VNI'
joined.loc[joined['Species ID from mash anlaysis'] == 'Cryptococcus neoformans var. grubii H99/Cryptococcus neoformans var. neoformans JEC21 hybrid', 'lineage'] = 'AD_hybrid'
joined.loc[joined['Species ID from mash anlaysis'] == 'Cryptococcus gattii WM276', 'lineage'] = 'gattii'
joined.loc[joined['VNI_subdivision'] == 'VNII', 'lineage'] = 'VNII'
joined.loc[joined['VNI_subdivision'] == 'VNII', 'VNI_subdivision'] = None

Put the columns `sample`, `strain`, `lineage` and `VNI_subdivision` to the left of the rest and sort by VNI_subdivision

In [6]:
joined = joined[['sample', 'strain','lineage','source', 'VNI_subdivision'] + [col for col in joined.columns if col not in ['sample', 'strain','lineage','source','VNI_subdivision'] ]]
joined = joined.sort_values('VNI_subdivision')

Print number of samples in each lineage

In [7]:
for col in joined.select_dtypes(include='object').columns:
    joined[col] = joined[col].astype('category')
joined.groupby(['Study', 'lineage'], observed=True).size().reset_index(name='counts')


Unnamed: 0,Study,lineage,counts
0,Ashton,AD_hybrid,5
1,Ashton,VNI,678
2,Ashton,VNII,4
3,Ashton,gattii,12
4,Desjardins,VNI,185


Make file with metadata of all samples in Ashton paper

In [8]:
joined.to_csv('metadata_all_ashton_and_vni_desj.csv', index = False)

Filter out the Desjardins samples

In [9]:
ashton = joined[joined['Study'] != 'Desjardins']

Filter out the missing sample that was not downloaded

In [10]:
ashton = ashton[ashton['SRA_Accession'] != 'ERR2624135']

In [11]:
ashton

Unnamed: 0,sample,strain,lineage,source,VNI_subdivision,Lab ID,Species ID from mash anlaysis,Study,HIV status,SRA_Accession,Country_of_origin,Continent of Origin,Year of Origin,Mean depth of mapping with MQ > 30 across whole genome,Proportion of genome covered by at least 5 reads which mapped with MQ > 30
342,ERS542448,14936_1#57,VNI,Clinical,VNIa-32,BK167,Cryptococcus neoformans var. grubii H99,Ashton,positive,ERR842672,Vietnam,Asia,2006.0,31.391425,0.959204
245,ERS2540945,04CN-65-072,VNI,Clinical,VNIa-32,04CN-65-072,Cryptococcus neoformans var. grubii H99,Ashton,positive,ERR2624432,Uganda,Africa,2013.0,54.041513,0.990784
275,ERS2541049,04CN-65-161,VNI,Clinical,VNIa-32,04CN-65-161,Cryptococcus neoformans var. grubii H99,Ashton,positive,ERR2624145,Uganda,Africa,,45.413277,0.991486
150,ERS2541305,04CN-63-036,VNI,Clinical,VNIa-32,04CN-63-036,Cryptococcus neoformans var. grubii H99,Ashton,positive,ERR2624319,Malawi,Africa,2014.0,49.200974,0.987908
547,ERS1142749,20427_2#52,VNI,Clinical,VNIa-32,BK92,Cryptococcus neoformans var. grubii H99,Ashton,positive,ERR1671650,Vietnam,Asia,2005.0,38.213466,0.960260
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
555,ERS2541337,BMD_1385,gattii,Clinical,,BMD_1385,Cryptococcus gattii WM276,Ashton,negative,ERR2624309,Vietnam,Asia,2007.0,27.804298,0.608818
556,ERS2541338,BMD_1516,gattii,Clinical,,BMD_1516,Cryptococcus gattii WM276,Ashton,negative,ERR2624397,Vietnam,Asia,2008.0,25.317098,0.601971
557,ERS2541339,BMD_2014,gattii,Clinical,,BMD_2014,Cryptococcus gattii WM276,Ashton,negative,ERR2624500,Vietnam,Asia,,20.731607,0.587846
581,ERS2541082,BMD1964,gattii,Clinical,,BMD1964,Cryptococcus gattii WM276,Ashton,negative,ERR2624244,Vietnam,Asia,2009.0,35.464555,0.601434


In [12]:
ashton.to_csv('metadata.csv', index = False, header = True)

Make file with non-VNI samples to exclude from the FungalPop analysis

In [None]:
exclude = ashton[ashton['lineage'] != 'VNI']
exclude = exclude.sort_values('lineage')
exclude.to_csv('metadata_ashton_non_vni.csv', index = False, header = True)
exclude['sample'].to_csv('exclude.txt', index = False, header = False)

## Compare names in the Ashton phylogeny to metadata

In [16]:
tree = Phylo.read('/FastData/czirion/Crypto_Diversity_Pipeline/analyses/data/raw/2017.06.09.all_ours_and_desj.snp_sites.mod.fa.cln.tree', 'newick')
tips =[tip.name for tip in tree.get_terminals()]
print(len(tips))
tips_not_in_metadata = [tip for tip in tips if tip not in joined['strain'].values]
tips_not_in_metadata

865


['GCF_000149245', '04CN-63-018']

In [17]:
strains_not_in_tree = joined[~joined['strain'].isin(tips)].reset_index(drop = True)
strains_not_in_tree

Unnamed: 0,sample,strain,lineage,source,VNI_subdivision,Lab ID,Species ID from mash anlaysis,Study,HIV status,SRA_Accession,Country_of_origin,Continent of Origin,Year of Origin,Mean depth of mapping with MQ > 30 across whole genome,Proportion of genome covered by at least 5 reads which mapped with MQ > 30
0,ERS2541256,04CN-03-031,AD_hybrid,Clinical,,04CN-03-031,Cryptococcus neoformans var. grubii H99/Crypto...,Ashton,positive,ERR2624093,Vietnam,Asia,2013.0,45.693733,0.97547
1,ERS2541170,04CN-03-074,VNII,Clinical,,04CN-03-074,Cryptococcus neoformans var. grubii H99,Ashton,positive,ERR2624467,Vietnam,Asia,2014.0,43.472265,0.965882
2,ERS2541130,04CN-03-088,VNII,Clinical,,04CN-03-088,Cryptococcus neoformans var. grubii H99,Ashton,positive,ERR2624180,Vietnam,Asia,2014.0,45.057615,0.96688
3,ERS2540936,04CN-63-006,gattii,Clinical,,04CN-63-006,Cryptococcus gattii WM276,Ashton,positive,ERR2624156,Uganda,Africa,2014.0,30.811483,0.611245
4,ERS2540975,04CN-63-020,gattii,Clinical,,04CN-63-020,Cryptococcus gattii WM276,Ashton,positive,ERR2624413,Uganda,Africa,2014.0,31.976266,0.604924
5,ERS2541070,04CN-63-021,gattii,Clinical,,04CN-63-021,Cryptococcus gattii WM276,Ashton,positive,ERR2624263,Uganda,Africa,2014.0,30.489235,0.602266
6,ERS2541050,04CN-64-092,AD_hybrid,Clinical,,04CN-64-092,Cryptococcus neoformans var. grubii H99/Crypto...,Ashton,positive,ERR2624285,Uganda,Africa,2014.0,32.871236,0.969852
7,ERS2541321,04CN-65-019,AD_hybrid,Clinical,,04CN-65-019,Cryptococcus neoformans var. grubii H99/Crypto...,Ashton,positive,ERR2624434,Uganda,Africa,2013.0,78.893862,0.978027
8,ERS2541042,04CN-65-031,VNII,Clinical,,04CN-65-031,Cryptococcus neoformans var. grubii H99,Ashton,positive,ERR2624151,Uganda,Africa,2013.0,31.248864,0.963734
9,ERS2541325,04CN-65-051,gattii,Clinical,,04CN-65-051,Cryptococcus gattii WM276,Ashton,positive,ERR2624462,Uganda,Africa,2013.0,44.914011,0.616338
