## Tumor samples from Zavidij *et al* Nature Cancer (2020)

The following CSV file was provided by Romanos.  The old samples were named differently, and the `Sample` column in the file has the old nomenclature. Note that the numbers of the healthy bone marrow samples (`NBM1-12`) do not correspond to the same patients with the same NBM numbers in the new tumor cohort. In both projects, the headlthy bone marrow samples were numbered when the samples become available.

In [109]:
import os.path
import subprocess
import numpy as np
import pandas as pd

df = pd.read_csv("/home/jtsuji/metadata/Zavidij_NatCancer2020/OksanaTumors.csv")
df

Unnamed: 0,Demux_Sample_ID,CaTissueID,Aliquot,TissueType,Fraction,Assay_Type,Sample,Patient,Diagnosis,Study,Treatment,10X_kit,Batch,Type,FilteredMatrices
0,pM5574_1_BM_CD138pos_GEX_3,pM5574,1,BM,CD138pos,GEX_3,pM5574CD138p,pM5574,MGUS,14-174,Non-trial,3_v2,B1,Zavidij et al. MGUS,gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/a...
1,pM4961_1_BM_CD138pos_GEX_3,pM4961,1,BM,CD138pos,GEX_3,pM4961CD138p,pM4961,MGUS,14-174,Non-trial,3_v2,B1,Zavidij et al. MGUS,gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/a...
2,pM4949_1_BM_CD138pos_GEX_3,pM4949,1,BM,CD138pos,GEX_3,pM4949CD138p,pM4949,NDMM,14-174,Non-trial,3_v2,B1,Zavidij et al. NDMM,gs://fc-c4d49882-335b-414a-86e2-949e34e940df/8...
3,pM4782_1_BM_CD138pos_GEX_3,pM4782,1,BM,CD138pos,GEX_3,pM4782CD138p,pM4782,LRSMM,14-174,Non-trial,3_v2,B1,Zavidij et al. LRSMM,gs://fc-c4d49882-335b-414a-86e2-949e34e940df/8...
4,pM4739_1_BM_CD138pos_GEX_3,pM4739,1,BM,CD138pos,GEX_3,pM4739CD138p,pM4739,MGUS,14-174,Non-trial,3_v2,B1,Zavidij et al. MGUS,gs://fc-c4d49882-335b-414a-86e2-949e34e940df/8...
5,pM4690_1_BM_CD138pos_GEX_3,pM4690,1,BM,CD138pos,GEX_3,pM4690CD138p,pM4690,MGUS,14-174,Non-trial,3_v2,B1,Zavidij et al. MGUS,gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/4...
6,pM4599_1_BM_CD138pos_GEX_3,pM4599,1,BM,CD138pos,GEX_3,pM4599CD138p,pM4599,MGUS,14-174,Non-trial,3_v2,B1,Zavidij et al. MGUS,gs://fc-c4d49882-335b-414a-86e2-949e34e940df/8...
7,pM4577_1_BM_CD138pos_GEX_3,pM4577,1,BM,CD138pos,GEX_3,pM4577CD138p,pM4577,NDMM,14-174,Non-trial,3_v2,B1,Zavidij et al. NDMM,gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/4...
8,pM4474_1_BM_CD138pos_GEX_3,pM4474,1,BM,CD138pos,GEX_3,pM4474CD138p,pM4474,LRSMM,14-174,Non-trial,3_v2,B1,Zavidij et al. LRSMM,gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/4...
9,pM4384_1_BM_CD138pos_GEX_3,pM4384,1,BM,CD138pos,GEX_3,pM4384CD138p,pM4384,MGUS,14-174,Non-trial,3_v2,B1,Zavidij et al. MGUS,gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/a...


There are 38 samples from the study, and all the GEX libraries are 3' based.

In [10]:
df['Diagnosis'].value_counts()

NBM      12
NDMM      9
HRSMM     8
MGUS      6
LRSMM     3
Name: Diagnosis, dtype: int64

In [46]:
# now get top directory of the google bucket
fastq_dir = df['FilteredMatrices'].apply(lambda x: '/'.join(x.split('/')[:3]))
fastq_dir.unique()

array(['gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb',
       'gs://fc-c4d49882-335b-414a-86e2-949e34e940df',
       'gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab',
       'gs://fc-c1ac8c88-9373-4b4f-8da2-682aa7b4f969',
       'gs://fc-265e6d00-fac3-47a6-b130-23beaa39f8f2'], dtype=object)

In [115]:
# fastq directory paths and collect all fastq files under the paths
fastq_gs_path = {
 'gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb': ['gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/fastq/May4_set/'],
 'gs://fc-c4d49882-335b-414a-86e2-949e34e940df': ['gs://fc-c4d49882-335b-414a-86e2-949e34e940df/fastq/Nov9_set/'],
 'gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab': ['gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/fastq/Dec20_set/'],
 'gs://fc-c1ac8c88-9373-4b4f-8da2-682aa7b4f969': ['gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/fastq/Dec20_set/',
                                                  'gs://fc-c1ac8c88-9373-4b4f-8da2-682aa7b4f969/samples/'],
 'gs://fc-265e6d00-fac3-47a6-b130-23beaa39f8f2': ['gs://fc-265e6d00-fac3-47a6-b130-23beaa39f8f2/fastq/Feb8_set/']
}

fastq_paths = set()
for fq in fastq_gs_path.keys():
    outs = subprocess.check_output("gsutil ls {}/**fastq.gz".format(fq), shell=True)
    outs = outs.decode('utf-8').rstrip().split('\n')
    fastq_paths = fastq_paths.union(set(outs))

fastq_path_dict = dict()
for path in fastq_paths:
    sample = os.path.dirname(path).split('/')[-1]
    fastq_path_dict.setdefault(sample, {'R1':set(), 'R2':set(), 'I1':set()})
    if len(os.path.basename(path).split('_R1_')) > 1:
        fastq_path_dict[sample]['R1'].add(path)
    elif len(os.path.basename(path).split('_R2_')) > 1:
        fastq_path_dict[sample]['R2'].add(path)
    elif len(os.path.basename(path).split('_I1_')) > 1:
        fastq_path_dict[sample]['I1'].add(path)
    else:
        print("weired path!: " + path)

** behavior is undefined if directly preceeded or followed by with characters other than / in the cloud and / locally.** behavior is undefined if directly preceeded or followed by with characters other than / in the cloud and / locally.** behavior is undefined if directly preceeded or followed by with characters other than / in the cloud and / locally.** behavior is undefined if directly preceeded or followed by with characters other than / in the cloud and / locally.** behavior is undefined if directly preceeded or followed by with characters other than / in the cloud and / locally.** behavior is undefined if directly preceeded or followed by with characters other than / in the cloud and / locally.** behavior is undefined if directly preceeded or followed by with characters other than / in the cloud and / locally.** behavior is undefined if directly preceeded or followed by with characters other than / in the cloud and / locally.** behavior is undefined if directly preceeded or follow

In [118]:
df['FASTQ_R1'] = np.full(len(df.index), np.nan)
df['FASTQ_R2'] = np.full(len(df.index), np.nan)
df['FASTQ_I1'] = np.full(len(df.index), np.nan)

# find sample specific fastqs from the fastq pool
for i in df.index:
    sample = df['Sample'][i].replace('CD138p', 'CD138P')
    if sample in fastq_path_dict:
        df.at[i, 'FASTQ_R1'] = fastq_path_dict[sample]['R1']
        df.at[i, 'FASTQ_R2'] = fastq_path_dict[sample]['R2']
        df.at[i, 'FASTQ_I1'] = fastq_path_dict[sample]['I1']
    else:
        raise Exception(sample + " is not found in the fastq file pool!")

df

Unnamed: 0,Demux_Sample_ID,CaTissueID,Aliquot,TissueType,Fraction,Assay_Type,Sample,Patient,Diagnosis,Study,Treatment,10X_kit,Batch,Type,FilteredMatrices,FASTQ_R1,FASTQ_R2,FASTQ_I1
0,pM5574_1_BM_CD138pos_GEX_3,pM5574,1,BM,CD138pos,GEX_3,pM5574CD138p,pM5574,MGUS,14-174,Non-trial,3_v2,B1,Zavidij et al. MGUS,gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/a...,{gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/...,{gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/...,{gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/...
1,pM4961_1_BM_CD138pos_GEX_3,pM4961,1,BM,CD138pos,GEX_3,pM4961CD138p,pM4961,MGUS,14-174,Non-trial,3_v2,B1,Zavidij et al. MGUS,gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/a...,{gs://fc-c4d49882-335b-414a-86e2-949e34e940df/...,{gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/...,{gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/...
2,pM4949_1_BM_CD138pos_GEX_3,pM4949,1,BM,CD138pos,GEX_3,pM4949CD138p,pM4949,NDMM,14-174,Non-trial,3_v2,B1,Zavidij et al. NDMM,gs://fc-c4d49882-335b-414a-86e2-949e34e940df/8...,{gs://fc-c4d49882-335b-414a-86e2-949e34e940df/...,{gs://fc-c4d49882-335b-414a-86e2-949e34e940df/...,{gs://fc-c4d49882-335b-414a-86e2-949e34e940df/...
3,pM4782_1_BM_CD138pos_GEX_3,pM4782,1,BM,CD138pos,GEX_3,pM4782CD138p,pM4782,LRSMM,14-174,Non-trial,3_v2,B1,Zavidij et al. LRSMM,gs://fc-c4d49882-335b-414a-86e2-949e34e940df/8...,{gs://fc-c4d49882-335b-414a-86e2-949e34e940df/...,{gs://fc-c4d49882-335b-414a-86e2-949e34e940df/...,{gs://fc-c4d49882-335b-414a-86e2-949e34e940df/...
4,pM4739_1_BM_CD138pos_GEX_3,pM4739,1,BM,CD138pos,GEX_3,pM4739CD138p,pM4739,MGUS,14-174,Non-trial,3_v2,B1,Zavidij et al. MGUS,gs://fc-c4d49882-335b-414a-86e2-949e34e940df/8...,{gs://fc-c4d49882-335b-414a-86e2-949e34e940df/...,{gs://fc-c4d49882-335b-414a-86e2-949e34e940df/...,{gs://fc-c4d49882-335b-414a-86e2-949e34e940df/...
5,pM4690_1_BM_CD138pos_GEX_3,pM4690,1,BM,CD138pos,GEX_3,pM4690CD138p,pM4690,MGUS,14-174,Non-trial,3_v2,B1,Zavidij et al. MGUS,gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/4...,{gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/...,{gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/...,{gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/...
6,pM4599_1_BM_CD138pos_GEX_3,pM4599,1,BM,CD138pos,GEX_3,pM4599CD138p,pM4599,MGUS,14-174,Non-trial,3_v2,B1,Zavidij et al. MGUS,gs://fc-c4d49882-335b-414a-86e2-949e34e940df/8...,{gs://fc-c4d49882-335b-414a-86e2-949e34e940df/...,{gs://fc-c4d49882-335b-414a-86e2-949e34e940df/...,{gs://fc-c4d49882-335b-414a-86e2-949e34e940df/...
7,pM4577_1_BM_CD138pos_GEX_3,pM4577,1,BM,CD138pos,GEX_3,pM4577CD138p,pM4577,NDMM,14-174,Non-trial,3_v2,B1,Zavidij et al. NDMM,gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/4...,{gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/...,{gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/...,{gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/...
8,pM4474_1_BM_CD138pos_GEX_3,pM4474,1,BM,CD138pos,GEX_3,pM4474CD138p,pM4474,LRSMM,14-174,Non-trial,3_v2,B1,Zavidij et al. LRSMM,gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/4...,{gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/...,{gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/...,{gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/...
9,pM4384_1_BM_CD138pos_GEX_3,pM4384,1,BM,CD138pos,GEX_3,pM4384CD138p,pM4384,MGUS,14-174,Non-trial,3_v2,B1,Zavidij et al. MGUS,gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/a...,{gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/...,{gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/...,{gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/...


In [121]:
# Three samples had multiple fastq paths
# pM4961CD138P
'gs://fc-c4d49882-335b-414a-86e2-949e34e940df/fastq/Nov9_set/pM4961CD138P/pM4961CD138P_S6_L002_R1_001.fastq.gz'
'gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/fastq/Dec20_set/pM4961CD138P/pM4961_138P_S38_L008_R1_001.fastq.gz'
'gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/fastq/May4_set/pM4961CD138P/pM4961CD138P_S18_L008_R1_001.fastq.gz'
'gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/fastq/May4_set/pM4961CD138P/pM4961CD138P_S18_L008_R2_001.fastq.gz'
'gs://fc-c4d49882-335b-414a-86e2-949e34e940df/fastq/Nov9_set/pM4961CD138P/pM4961CD138P_S6_L002_R2_001.fastq.gz'
'gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/fastq/Dec20_set/pM4961CD138P/pM4961_138P_S38_L008_R2_001.fastq.gz'
'gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/fastq/Dec20_set/pM4961CD138P/pM4961_138P_S38_L008_I1_001.fastq.gz'
'gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/fastq/May4_set/pM4961CD138P/pM4961CD138P_S18_L008_I1_001.fastq.gz'
'gs://fc-c4d49882-335b-414a-86e2-949e34e940df/fastq/Nov9_set/pM4961CD138P/pM4961CD138P_S6_L002_I1_001.fastq.gz'

# pM4384CD138P
'gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/fastq/May4_set/pM4384CD138P/pM4384CD138P_S16_L008_R1_001.fastq.gz'
'gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/fastq/Dec20_set/pM4384CD138P/pM4384_138P_S1_L001_R1_001.fastq.gz'
'gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/fastq/May4_set/pM4384CD138P/pM4384CD138P_S16_L008_R2_001.fastq.gz'
'gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/fastq/Dec20_set/pM4384CD138P/pM4384_138P_S1_L001_R2_001.fastq.gz'
'gs://fc-dca6742a-b561-4cd0-9f9c-5ff3f6cde2fb/fastq/May4_set/pM4384CD138P/pM4384CD138P_S16_L008_I1_001.fastq.gz'
'gs://fc-b85bdd54-c175-4a79-958b-5bfaed569fab/fastq/Dec20_set/pM4384CD138P/pM4384_138P_S1_L001_I1_001.fastq.gz'

# NBM2CD138P
'gs://fc-c4d49882-335b-414a-86e2-949e34e940df/fastq/Nov9_set/NBM2CD138P/NBM2CD138P_S33_L007_R1_001.fastq.gz'
'gs://fc-c4d49882-335b-414a-86e2-949e34e940df/fastq/Nov9_set/NBM2CD138P/NBM2CD138P_S01_L001_R1_001.fastq.gz'
'gs://fc-c4d49882-335b-414a-86e2-949e34e940df/fastq/Nov9_set/NBM2CD138P/NBM2CD138P_S33_L007_R2_001.fastq.gz'
'gs://fc-c4d49882-335b-414a-86e2-949e34e940df/fastq/Nov9_set/NBM2CD138P/NBM2CD138P_S01_L001_R2_001.fastq.gz'
'gs://fc-c4d49882-335b-414a-86e2-949e34e940df/fastq/Nov9_set/NBM2CD138P/NBM2CD138P_S01_L001_I1_001.fastq.gz'
'gs://fc-c4d49882-335b-414a-86e2-949e34e940df/fastq/Nov9_set/NBM2CD138P/NBM2CD138P_S33_L007_I1_001.fastq.gz'

print("... investigating FASTQ paths done!")

... investigating FASTQ paths done!


### Write a shell script to download FASTQs and rename the files with Demux_Sample_ID

In [127]:
fo = open("/home/jtsuji/metadata/Zavidij_NatCancer2020/download.sh", "w")
for i, row in df.iterrows():
    vm_dir = '/mnt/nfs/fastq/' + row['Demux_Sample_ID']
    fo.write("mkdir -p {}".format(vm_dir) + "\n")
    for ty in ['R1', 'R2', 'I1']:
        key = 'FASTQ_' + ty
        for original_fastq in row[key]:
            original_fastq_basename = os.path.basename(original_fastq)
            suffix1 = original_fastq_basename.replace(row['Sample'].replace('CD138p', 'CD138P'), '')
            suffix2 = original_fastq_basename.replace(row['Sample'].replace('CD138p', '_138P'), '')
            if suffix1.startswith('_'):
                fastq_name = row['Demux_Sample_ID'] + suffix1
            elif suffix2.startswith('_'):
                fastq_name = row['Demux_Sample_ID'] + suffix2
            else:
                fastq_name = row['Demux_Sample_ID'] + '_'.join(original_fastq_basename.split('_')[1:])
            new_fastq = vm_dir + "/" + fastq_name
            fo.write("gsutil cp {} {}".format(original_fastq, new_fastq) + "\n")
fo.close()
print("... shell script writen!")

... shell script writen!


While compiling the fastq names, `NBM1.3_1_BM_CD138pos_GEX_3` had a different fastq name:

```
gs://fc-c4d49882-335b-414a-86e2-949e34e940df/fastq/Nov9_set/NBM1CD138P/FrozNBMCD138P_S5_L005_R1_001.fastq.gz
gs://fc-c4d49882-335b-414a-86e2-949e34e940df/fastq/Nov9_set/NBM1CD138P/FrozNBMCD138P_S5_L005_R2_001.fastq.gz
gs://fc-c4d49882-335b-414a-86e2-949e34e940df/fastq/Nov9_set/NBM1CD138P/FrozNBMCD138P_S5_L005_I1_001.fastq.gz
```

In [128]:
fastq_paths

{'gs://fc-265e6d00-fac3-47a6-b130-23beaa39f8f2/fastq/Feb8_set/NBM10CD138N/NBM10CD138N_S8_L002_I1_001.fastq.gz',
 'gs://fc-265e6d00-fac3-47a6-b130-23beaa39f8f2/fastq/Feb8_set/NBM10CD138N/NBM10CD138N_S8_L002_R1_001.fastq.gz',
 'gs://fc-265e6d00-fac3-47a6-b130-23beaa39f8f2/fastq/Feb8_set/NBM10CD138N/NBM10CD138N_S8_L002_R2_001.fastq.gz',
 'gs://fc-265e6d00-fac3-47a6-b130-23beaa39f8f2/fastq/Feb8_set/NBM10CD138P/NBM10CD138P_S7_L002_I1_001.fastq.gz',
 'gs://fc-265e6d00-fac3-47a6-b130-23beaa39f8f2/fastq/Feb8_set/NBM10CD138P/NBM10CD138P_S7_L002_R1_001.fastq.gz',
 'gs://fc-265e6d00-fac3-47a6-b130-23beaa39f8f2/fastq/Feb8_set/NBM10CD138P/NBM10CD138P_S7_L002_R2_001.fastq.gz',
 'gs://fc-265e6d00-fac3-47a6-b130-23beaa39f8f2/fastq/Feb8_set/NBM11CD138N/NBM11CD138N_S2_L001_I1_001.fastq.gz',
 'gs://fc-265e6d00-fac3-47a6-b130-23beaa39f8f2/fastq/Feb8_set/NBM11CD138N/NBM11CD138N_S2_L001_R1_001.fastq.gz',
 'gs://fc-265e6d00-fac3-47a6-b130-23beaa39f8f2/fastq/Feb8_set/NBM11CD138N/NBM11CD138N_S2_L001_R2_001.fas