The `Ashton_Supplementary_Data_1.csv` file was downloaded from [here](https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-019-10092-5/MediaObjects/41467_2019_10092_MOESM5_ESM.xlsx)  
It was converted to csv, the first two lines were removed and the values in Sequencing ID where used to fill the ENA accession column where it was missing  
The values in ENA accession of the Ashton Study were used to download the sequencing data with the downloading-tools workflow.  
The `reads_tables.csv` file was created by the downloading-tools workflow.  
The run ERR2624135 was not possible to download.  

In [None]:
import pandas as pd
import os
os.chdir("/FastData/czirion/WeavePop_Cneoformans/")

Input

In [21]:
original_metadata_path = "Crypto_Ashton/config/Ashton_Supplementary_Data_1.csv"
reads_table_path = "Crypto_Ashton/config/reads_table.csv"

Output

In [22]:
full_metadata_path = "Crypto_Ashton/config/metadata_all_ashton_and_vni_desj.csv"
vni_metadata_path = "Crypto_Ashton/config/metadata_vni_ashton_and_vni_desj.csv"
non_vni_metadata_path = "Crypto_Ashton/config/metadata_non_vni_ashton.csv"
vni_ashton_metadata_path = "Crypto_Ashton/config/metadata.csv" # The one to use in WeavePop
non_vni_list_path = "Crypto_Ashton/config/non_vni.txt"

Read original tables

In [23]:
original = pd.read_csv(original_metadata_path, header = 0)

In [24]:
reads = pd.read_csv(reads_table_path, header = 0)
reads = reads[['sample', 'run']]

Join the metadata and the read accessions to have a sample name and get the strain name from the Sequencing ID column

In [25]:
joined = pd.merge(original, reads, left_on = 'ENA accession', right_on = 'run', how = 'left')
joined = joined.drop(columns = ['run'])
joined = joined.rename(columns = {'Sequencing ID': 'strain', 'Isolation source': 'source', 'Sub-clade' : 'vni_subdivision', 'Country of origin': 'country_of_origin', 'ENA accession': 'sra_accession'})

Assign lineage from the column `Species ID from mash analysis`

In [26]:
joined.loc[joined['Species ID from mash anlaysis'] == 'Cryptococcus neoformans var. grubii H99', 'lineage'] = 'VNI'
joined.loc[joined['Species ID from mash anlaysis'] == 'Cryptococcus neoformans var. grubii H99/Cryptococcus neoformans var. neoformans JEC21 hybrid', 'lineage'] = 'AD_hybrid'
joined.loc[joined['Species ID from mash anlaysis'] == 'Cryptococcus gattii WM276', 'lineage'] = 'gattii'
joined.loc[joined['vni_subdivision'] == 'VNII', 'lineage'] = 'VNII'
joined.loc[joined['vni_subdivision'] == 'VNII', 'vni_subdivision'] = None

Convert all columns to lowercase and replace spaces with underscores

In [27]:
joined.columns = joined.columns.str.lower().str.replace(' ', '_', regex=True)

Create column for VNIa subdivision

In [28]:
joined.loc[joined["vni_subdivision"].str.contains("VNIa", na=False), "vnia_subdivision"] = joined["vni_subdivision"]
joined.loc[:, "vni_subdivision" ] = joined["vni_subdivision"].str.split("-").str[0]

Put the columns `sample`, `strain`, `lineage` and `vni_subdivision` to the left of the rest and sort by vni_subdivision

In [29]:
joined = joined[['sample', 'strain','lineage','source', 'vni_subdivision', 'vnia_subdivision'] + [col for col in joined.columns if col not in ['sample', 'strain','lineage','source','vni_subdivision', 'vnia_subdivision'] ]]
joined = joined.sort_values('vni_subdivision')

## Save multiple metadata tables with different subsets

All samples in Ashton paper

In [30]:
joined.to_csv(full_metadata_path, index = False)

In [31]:
joined.groupby(['study', 'lineage'], observed=True).size().reset_index(name='counts')


Unnamed: 0,study,lineage,counts
0,Ashton,AD_hybrid,5
1,Ashton,VNI,678
2,Ashton,VNII,4
3,Ashton,gattii,12
4,Desjardins,VNI,185


All VNI samples in Ashton paper, except ERR2624135. Includes Desjardins VNI

In [32]:
VNI = joined[joined['lineage'] == 'VNI']
VNI = VNI[VNI['sra_accession'] != "ERR2624135"]
VNI.to_csv(vni_metadata_path, index = False)

In [33]:
VNI.groupby(['study', 'lineage'], observed=True).size().reset_index(name='counts')


Unnamed: 0,study,lineage,counts
0,Ashton,VNI,677
1,Desjardins,VNI,185


VNI Ashton samples (without Desjardins VNI and without ERR2624135)

In [34]:
VNI_ashton = VNI[VNI['study'] == 'Ashton']
VNI_ashton.to_csv(vni_ashton_metadata_path, index = False)

In [35]:
VNI_ashton.groupby(['vni_subdivision'], observed=True).size().reset_index(name='counts')


Unnamed: 0,vni_subdivision,counts
0,VNIa,667
1,VNIb,10


Non VNI samples

In [36]:
non_VNI = joined[joined['lineage'] != 'VNI']
non_VNI = non_VNI.sort_values('lineage')
non_VNI.to_csv(non_vni_metadata_path, index = False)

In [37]:
non_VNI.groupby(['lineage'], observed=True).size().reset_index(name='counts')

Unnamed: 0,lineage,counts
0,AD_hybrid,5
1,VNII,4
2,gattii,12


Print list of non VNI samples

In [38]:
non_VNI['sample'].to_csv(non_vni_list_path, index = False, header = False)