# Upload sequencing data to SRA
This Python Jupyter notebook uploads the sequencing data to the NIH [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra), or SRA.

## Create BioProject and BioSamples
The first step was done manually to create the BioProject and BioSamples.
Note that for new future uploads related to the RBD DMS, you may be able to use the existing BioProject, but since this is the first entries in these project I needed to create a new BioProject.

To create these, I went to the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) and signed in using the box at the upper right of the webpage, and then went to the [SRA Submission Portal](https://submit.ncbi.nlm.nih.gov/subs/sra/).
I then manually completed the first five steps, which define the project and samples.

## Create submission sheet
The sixth step is to create the submission sheet in `*.tsv` format, which is done by the following code.

First, import Python modules:

In [1]:
import ftplib
import os
import tarfile
import datetime

import natsort

import pandas as pd

import yaml

Read the configuration for the analysis:

In [2]:
with open('../config.yaml') as f:
    config = yaml.safe_load(f)

Read the PacBio runs:

In [3]:
pacbio_runs_file = os.path.join('../', config['pacbio_runs'])

print(f"Reading PacBio runs from {pacbio_runs_file}")

pacbio_runs = (
    pd.read_csv(pacbio_runs_file)
    .assign(ccs_file=lambda x: f"../{config['ccs_dir']}/" + x['library'] + '_' + x['run'] + '_ccs.fastq.gz')
    )

pacbio_runs.head()

Reading PacBio runs from ../data/PacBio_runs.csv


Unnamed: 0,library,run,subreads,ccs_file
0,lib1,200415_A,/fh/fast/bloom_j/SR/ngs/pacbio/200415_TylerSta...,../results/ccs/lib1_200415_A_ccs.fastq.gz
1,lib1,200415_B,/fh/fast/bloom_j/SR/ngs/pacbio/200415_TylerSta...,../results/ccs/lib1_200415_B_ccs.fastq.gz
2,lib2,200415_A,/fh/fast/bloom_j/SR/ngs/pacbio/200415_TylerSta...,../results/ccs/lib2_200415_A_ccs.fastq.gz
3,lib2,200415_B,/fh/fast/bloom_j/SR/ngs/pacbio/200415_TylerSta...,../results/ccs/lib2_200415_B_ccs.fastq.gz


Next make submission entries for the PacBio CCSs:

In [4]:
pacbio_submissions = (
    pacbio_runs
    .assign(
        sample_name='PacBio_CCSs',  # BioSample created in SRA wizard
        library_ID=lambda x: x['library'] + '_PacBio_CCSs',  # unique library ID
        title='PacBio CCSs linking variants to barcodes for SARS-CoV-2 RBD deep mutational scanning',
        library_strategy='Synthetic-Long-Read',
        library_source='SYNTHETIC',
        library_selection='Restriction Digest',
        library_layout='single',
        platform='PACBIO_SMRT',
        instrument_model='PacBio Sequel',
        design_description='Restriction digest of plasmids carrying barcoded RBD variants',
        filetype='fastq',
        filename_fullpath=lambda x: x['ccs_file'],      
        )
    .drop(columns=pacbio_runs.columns)
    )

pacbio_submissions.head()

Unnamed: 0,sample_name,library_ID,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,filename_fullpath
0,PacBio_CCSs,lib1_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,fastq,../results/ccs/lib1_200415_A_ccs.fastq.gz
1,PacBio_CCSs,lib1_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,fastq,../results/ccs/lib1_200415_B_ccs.fastq.gz
2,PacBio_CCSs,lib2_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,fastq,../results/ccs/lib2_200415_A_ccs.fastq.gz
3,PacBio_CCSs,lib2_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,fastq,../results/ccs/lib2_200415_B_ccs.fastq.gz


Read the Illumina runs:

In [5]:
illumina_runs_file = os.path.join('../', config['barcode_runs'])

print(f"Reading Illumina runs from {illumina_runs_file}")

illumina_runs = pd.read_csv(illumina_runs_file)

illumina_runs.head()

Reading Illumina runs from ../data/barcode_runs.csv


Unnamed: 0,library,sample,sample_type,sort_bin,concentration,date,number_cells,R1
0,lib1,SortSeq_bin1,SortSeq,1,,200416,6600000,/shared/ngs/illumina/tstarr/200427_D00300_0952...
1,lib1,SortSeq_bin2,SortSeq,2,,200416,3060000,/shared/ngs/illumina/tstarr/200427_D00300_0952...
2,lib1,SortSeq_bin3,SortSeq,3,,200416,2511000,/shared/ngs/illumina/tstarr/200427_D00300_0952...
3,lib1,SortSeq_bin4,SortSeq,4,,200416,2992000,/shared/ngs/illumina/tstarr/200427_D00300_0952...
4,lib2,SortSeq_bin1,SortSeq,1,,200416,6420000,/shared/ngs/illumina/tstarr/200427_D00300_0953...


Next make submission entries for Illumina data:

In [6]:
illumina_submissions = (
    illumina_runs
    .assign(
        sample_name=lambda x: x['sample_type'].map({'SortSeq': 'expression_barcodes',
                                                    'TiteSeq': 'hACE2_binding_barcodes'}),
        library_ID=lambda x: x['library'] + '_' + x['sample'],
        title=lambda x: 'SARS-CoV-2 RBD deep mutational scanning Illumina barcode sequencing for ' + x['sample'],
        library_strategy='AMPLICON',
        library_source='SYNTHETIC',
        library_selection='PCR',
        library_layout='single',
        platform='ILLUMINA',
        instrument_model='Illumina HiSeq 2500',
        design_description='PCR of barcodes from RBD variants',
        filetype='fastq',
        filename_fullpath=lambda x: x['R1'].str.split(';'),       
        )
    .explode('filename_fullpath')
    .assign(filename_fullpath=lambda x: x['filename_fullpath'].str.strip())
    .drop(columns=illumina_runs.columns)
    .reset_index(drop=True)
    )

illumina_submissions.head()

Unnamed: 0,sample_name,library_ID,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,filename_fullpath
0,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/shared/ngs/illumina/tstarr/200427_D00300_0952...
1,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/fh/fast/bloom_j/SR/ngs/illumina/tstarr/200427...
2,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/fh/fast/bloom_j/SR/ngs/illumina/tstarr/200427...
3,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/fh/fast/bloom_j/SR/ngs/illumina/tstarr/200427...
4,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/fh/fast/bloom_j/SR/ngs/illumina/tstarr/200427...


Now concatenate the PacBio and Illumina submissions into tidy format (one line per file), make sure all the files exist, and also make short name versions of them that lack the path:

In [7]:
submissions_tidy = (
    pd.concat([pacbio_submissions, illumina_submissions], ignore_index=True)
    .assign(file_exists=lambda x: x['filename_fullpath'].map(os.path.isfile),
            filename=lambda x: x['filename_fullpath'].map(os.path.basename),
            )
    )

assert submissions_tidy['file_exists'].all(), submissions_tidy.query('file_exists == False')

submissions_tidy.head()

Unnamed: 0,sample_name,library_ID,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,filename_fullpath,file_exists,filename
0,PacBio_CCSs,lib1_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,fastq,../results/ccs/lib1_200415_A_ccs.fastq.gz,True,lib1_200415_A_ccs.fastq.gz
1,PacBio_CCSs,lib1_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,fastq,../results/ccs/lib1_200415_B_ccs.fastq.gz,True,lib1_200415_B_ccs.fastq.gz
2,PacBio_CCSs,lib2_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,fastq,../results/ccs/lib2_200415_A_ccs.fastq.gz,True,lib2_200415_A_ccs.fastq.gz
3,PacBio_CCSs,lib2_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,fastq,../results/ccs/lib2_200415_B_ccs.fastq.gz,True,lib2_200415_B_ccs.fastq.gz
4,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,fastq,/shared/ngs/illumina/tstarr/200427_D00300_0952...,True,200416_lib1_FITCbin1_TGGAACAA_L001_R1_001.fast...


For the actual submission, we need a "wide" data frame that for each unique `sample_name` / `library_ID` gives all of the files each in different columns.
These should be files without the full path.

First, look at how many files there are for each sample / library:

In [8]:
(submissions_tidy
 .groupby(['sample_name', 'library_ID'])
 .aggregate(n_files=pd.NamedAgg('filename_fullpath', 'count'))
 .sort_values('n_files', ascending=False)
 .reset_index()
 )

Unnamed: 0,sample_name,library_ID,n_files
0,expression_barcodes,lib1_SortSeq_bin1,10
1,expression_barcodes,lib2_SortSeq_bin1,8
2,hACE2_binding_barcodes,lib1_TiteSeq_12_bin1,6
3,hACE2_binding_barcodes,lib2_TiteSeq_16_bin1,6
4,hACE2_binding_barcodes,lib2_TiteSeq_13_bin1,6
...,...,...,...
133,hACE2_binding_barcodes,lib1_TiteSeq_10_bin4,2
134,hACE2_binding_barcodes,lib1_TiteSeq_10_bin3,2
135,hACE2_binding_barcodes,lib1_TiteSeq_10_bin2,2
136,hACE2_binding_barcodes,lib2_TiteSeq_16_bin4,2


Now make the wide submission data frame.
Note we keep only the filename column with the path lacking the full directory information:

In [9]:
submissions_wide = (
    submissions_tidy
    .assign(
        filename_count=lambda x: x.groupby(['sample_name', 'library_ID'])['filename'].cumcount() + 1,
        filename_col=lambda x: 'filename' + x['filename_count'].map(lambda c: str(c) if c > 1 else '')
        )
    .pivot(
        index='library_ID',
        columns='filename_col',
        values='filename',
        )
    )

submissions_wide = (
    submissions_tidy
    .drop(columns=['filename_fullpath', 'file_exists', 'filename'])
    .drop_duplicates()
    .merge(submissions_wide[natsort.natsorted(submissions_wide.columns)],
           on='library_ID',
           validate='one_to_one',
           )
    )

submissions_wide

Unnamed: 0,sample_name,library_ID,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,...,filename,filename2,filename3,filename4,filename5,filename6,filename7,filename8,filename9,filename10
0,PacBio_CCSs,lib1_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,...,lib1_200415_A_ccs.fastq.gz,lib1_200415_B_ccs.fastq.gz,,,,,,,,
1,PacBio_CCSs,lib2_PacBio_CCSs,PacBio CCSs linking variants to barcodes for S...,Synthetic-Long-Read,SYNTHETIC,Restriction Digest,single,PACBIO_SMRT,PacBio Sequel,Restriction digest of plasmids carrying barcod...,...,lib2_200415_A_ccs.fastq.gz,lib2_200415_B_ccs.fastq.gz,,,,,,,,
2,expression_barcodes,lib1_SortSeq_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,...,200416_lib1_FITCbin1_TGGAACAA_L001_R1_001.fast...,200416_lib1_FITCbin1_TGGAACAA_L001_R1_002.fast...,200416_lib1_FITCbin1_TGGAACAA_L001_R1_003.fast...,200416_lib1_FITCbin1_TGGAACAA_L001_R1_004.fast...,200416_lib1_FITCbin1_TGGAACAA_L001_R1_005.fast...,200416_lib1_FITCbin1_TGGAACAA_L002_R1_001.fast...,200416_lib1_FITCbin1_TGGAACAA_L002_R1_002.fast...,200416_lib1_FITCbin1_TGGAACAA_L002_R1_003.fast...,200416_lib1_FITCbin1_TGGAACAA_L002_R1_004.fast...,200416_lib1_FITCbin1_TGGAACAA_L002_R1_005.fast...
3,expression_barcodes,lib1_SortSeq_bin2,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,...,200416_lib1_FITCbin2_TGGCTTCA_L001_R1_001.fast...,200416_lib1_FITCbin2_TGGCTTCA_L001_R1_002.fast...,200416_lib1_FITCbin2_TGGCTTCA_L002_R1_001.fast...,200416_lib1_FITCbin2_TGGCTTCA_L002_R1_002.fast...,,,,,,
4,expression_barcodes,lib1_SortSeq_bin3,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,...,200416_lib1_FITCbin3_TGGTGGTA_L001_R1_001.fast...,200416_lib1_FITCbin3_TGGTGGTA_L001_R1_002.fast...,200416_lib1_FITCbin3_TGGTGGTA_L002_R1_001.fast...,200416_lib1_FITCbin3_TGGTGGTA_L002_R1_002.fast...,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133,hACE2_binding_barcodes,lib2_TiteSeq_15_bin4,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,...,200422_s15-b4_TAGGATGA_L001_R1_001.fastq.gz,200422_s15-b4_TAGGATGA_L002_R1_001.fastq.gz,,,,,,,,
134,hACE2_binding_barcodes,lib2_TiteSeq_16_bin1,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,...,200422_s16-b1_TATCAGCA_L001_R1_001.fastq.gz,200422_s16-b1_TATCAGCA_L001_R1_002.fastq.gz,200422_s16-b1_TATCAGCA_L001_R1_003.fastq.gz,200422_s16-b1_TATCAGCA_L002_R1_001.fastq.gz,200422_s16-b1_TATCAGCA_L002_R1_002.fastq.gz,200422_s16-b1_TATCAGCA_L002_R1_003.fastq.gz,,,,
135,hACE2_binding_barcodes,lib2_TiteSeq_16_bin2,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,...,200422_s16-b2_TCCGTCTA_L001_R1_001.fastq.gz,200422_s16-b2_TCCGTCTA_L002_R1_001.fastq.gz,,,,,,,,
136,hACE2_binding_barcodes,lib2_TiteSeq_16_bin3,SARS-CoV-2 RBD deep mutational scanning Illumi...,AMPLICON,SYNTHETIC,PCR,single,ILLUMINA,Illumina HiSeq 2500,PCR of barcodes from RBD variants,...,200422_s16-b3_TCTTCACA_L001_R1_001.fastq.gz,200422_s16-b3_TCTTCACA_L002_R1_001.fastq.gz,,,,,,,,


Now write the wide submissions data frame to a `*.tsv` file:

In [10]:
submissions_spreadsheet = 'SRA_submission_spreadsheet.tsv'

submissions_wide.to_csv(submissions_spreadsheet, sep='\t', index=False)

This submission sheet was then manually uploaded in Step 6 of the SRA submission wizard (*SRA metadata*).

## Upload the actual files
Step 7 of the SRA submission wizard is to upload the files.
In order to do this, we first make a `*.tar` file with all of the files.
Since this takes a long time, we only create the file if it doesn't already exist, so it is only created the first time this notebook is run.
**Note that this will cause a problem if you add more sequencing files to upload after running the notebook, in that case the cell below will need to altered.**

In [11]:
tar_filename = 'SRA_submission.tar'

if os.path.isfile(tar_filename):
    print(f"{tar_filename} already exists, not creating it again")
else:
    try:
        with tarfile.open(tar_filename, mode='w') as f:
            for i, tup in enumerate(submissions_tidy.itertuples()):
                print(f"Adding file {i + 1} of {len(submissions_tidy)} to {tar_filename}")
                f.add(tup.filename_fullpath, arcname=tup.filename)
            print(f"Added all files to {tar_filename}")
    except:
        if os.path.isfile(tar_filename):
            os.remove(tar_filename)
        raise

SRA_submission.tar already exists, not creating it again


See the size of the `*.tar` file to upload and make sure it has the expected files:

In [12]:
print(f"The size of {tar_filename} is {os.path.getsize(tar_filename) / 1e9:.1f} GB")

with tarfile.open(tar_filename) as f:
    files_in_tar = set(f.getnames())
if files_in_tar == set(submissions_tidy['filename']):
    print(f"{tar_filename} contains all {len(files_in_tar)} expected files.")
else:
    raise ValueError(f"{tar_filename} does not have all the expected files.")

The size of SRA_submission.tar is 26.2 GB
SRA_submission.tar contains all 376 expected files.


The SRA instructions then give several ways to upload; we will do it using the FTP method.
First, specify the FTP address, username, password, and subfolder given by the SRA submission wizard instructions.
In order to avoid having the password be public here, that is in a separate text file that is **not** included in the GitHub repo (so this needs to be run in Jesse's directory that has this password):

In [13]:
# the following are provided by SRA wizard insturctions
ftp_address = 'ftp-private.ncbi.nlm.nih.gov'
ftp_username = 'subftp'
ftp_account_folder = 'uploads/jbloom_fhcrc.org_IuMBgK44'
with open('ftp_password.txt') as f:
    ftp_password = f.read().strip()
    
# meaningful name for subfolder
ftp_subfolder = 'SARS-CoV-2-RBD_DMS'

Now create FTP connection and upload the TAR file.
Note that this takes a while.
If you are worried that it will timeout given the size of your file, you can run this notebook via `slurm` so there is no timing out:

In [14]:
print(f"Starting upload at {datetime.datetime.now()}")

with ftplib.FTP(ftp_address) as ftp:
    ftp.login(user=ftp_username,
              passwd=ftp_password,
              )
    ftp.cwd(ftp_account_folder)
    ftp.mkd(ftp_subfolder)
    ftp.cwd(ftp_subfolder)
    with open(tar_filename, 'rb') as f:
        ftp.storbinary(f"STOR {tar_filename}", f)
        
print(f"Finished upload at {datetime.datetime.now()}")

Starting upload at 2020-06-16 18:07:42.589806
Finished upload at 2020-06-17 00:45:05.078738


Finally, used the SRA wizard to select the `*.tar` archive and complete the submission.
Note that there is a warning of missing files since everything was uploaded as a `*.tar` rather than individual files.
They should all be found when you hit the button to proceed and the `*.tar` is unpacked.

There was then a message that the submission was processing, and data would be released immediately upon processing.
The submission number is `SUB7594564`.