## Uploading the deep-sequencing data from 2B06 mutational antigenic profiling

This notebook describes how I am uploading the deep-sequencing data for the mutational antigenic profiling of 2B06 with Perth09 HA libraries.

Lauren Gentles, December 23, 2020

## Compile metadata on all the samples

First, I will make an outfile with metadata for each sample. 

In [58]:
# Import python modules
import glob
import os
import sys
import pandas

# Global variables
samples = pandas.read_csv('data/samplelist.csv')
samples['name'] = samples.selection + '-' + samples.library
metadata_f = 'metadata.txt'

# List of fastq directories
fastqdirs = ['/fh/fast/bloom_j/SR/ngs/illumina/bloom_lab/190524_M04866_0253_000000000-CGDFR_new-demux/Data/Intensities/BaseCalls/*/']

# Compile metadata on each sample, including names of all fastq files for that sample
with open(metadata_f, 'w') as f:
    # Write first line in metadata file
    first_line = 'bioproject_accession\tsample_name\tlibrary_ID\ttitle\tlibrary_strategy\tlibrary_source\tlibrary_selection\tlibrary_layout\tplatform\tinstrument_model\tdesign_description\tfiletype\tfilename\tfilename2\tfilename3\tfilename4\tfilename5\tfilename6\tfilename7\tfilename8\tfilename9\tfilename10\tfilename11\tfilename12\tfilename13\tfilename14\n'
    f.write(first_line.replace('\t', ','))
    
    # Make rest of entries in file
    all_fastq_files = []
    all_fastq_files.clear()
    for s in samples['name']:
        # Find all fastq files for a given sample
        print("\nGetting metadata for {0}. Here is a list of all fastq file basenames:".format(s))
        f1s = samples.loc[samples['name'] == s, 'R1']
        f2s = [f1.replace('_R1_', '_R2_') for f1 in f1s]
        for (f1, f2) in zip(f1s, f2s):
            assert os.path.isfile(f1) and os.path.isfile(f2)
            #assert '{0}'.format(s) in f1 and '{0}'.format(s) in f2
            print (os.path.basename(f1))
            print (os.path.basename(f2))
            
        # Append fastq files for sample to list of all fastq files
        all_fastq_files.extend(f1s+f2s)

        # Make entry in the metadata file for sample
        sample_line = ' \tPerth 2009 H3 library 2B06 profiling\t{0}\tdeep sequencing of library\tAMPLICON\tOTHER\tPCR\tpaired\tILLUMINA\tIllumina HiSeq 2500\tbarcoded-subamplicon sequencing\tfastq\t'.format(s) + '\t'.join(f1s+f2s) + '\n'
        f.write(sample_line.replace('\t', ','))


Getting metadata for 2B06-Lib.1. Here is a list of all fastq file basenames:
2B06-50ug_S1_L001_R1_001.fastq.gz
2B06-50ug_S1_L001_R2_001.fastq.gz

Getting metadata for Mock-Lib.1. Here is a list of all fastq file basenames:
Lib1-mock-rep2_S3_L001_R1_001.fastq.gz
Lib1-mock-rep2_S3_L001_R2_001.fastq.gz

Getting metadata for WSN-plasmid- . Here is a list of all fastq file basenames:
WSN-HA-plasmid_S2_L001_R1_001.fastq.gz
WSN-HA-plasmid_S2_L001_R2_001.fastq.gz

Getting metadata for 2B06-Lib.2. Here is a list of all fastq file basenames:
Lib2-2B06-25ug_S1_L001_R1_001.fastq.gz
Lib2-2B06-25ug_S1_L001_R2_001.fastq.gz

Getting metadata for Mock-Lib.2. Here is a list of all fastq file basenames:
Lib2-mock_S2_L001_R1_001.fastq.gz
Lib2-mock_S2_L001_R2_001.fastq.gz

Getting metadata for 2B06-Lib.3. Here is a list of all fastq file basenames:
Lib3-2B06-25ug_S3_L001_R1_001.fastq.gz
Lib3-2B06-25ug_S3_L001_R2_001.fastq.gz

Getting metadata for Mock-Lib.3. Here is a list of all fastq file basenames:
Lib

In [59]:
cat metadata.txt

bioproject_accession,sample_name,library_ID,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,filename,filename2,filename3,filename4,filename5,filename6,filename7,filename8,filename9,filename10,filename11,filename12,filename13,filename14
 ,Perth 2009 H3 library 2B06 profiling,2B06-Lib.1,deep sequencing of library,AMPLICON,OTHER,PCR,paired,ILLUMINA,Illumina HiSeq 2500,barcoded-subamplicon sequencing,fastq,/fh/fast/bloom_j/SR/ngs/illumina/lgentles/190419_M03100_0418_000000000-CCW7R/Data/Intensities/BaseCalls/2B06-50ug_S1_L001_R1_001.fastq.gz/fh/fast/bloom_j/SR/ngs/illumina/lgentles/190419_M03100_0418_000000000-CCW7R/Data/Intensities/BaseCalls/2B06-50ug_S1_L001_R2_001.fastq.gz
 ,Perth 2009 H3 library 2B06 profiling,Mock-Lib.1,deep sequencing of library,AMPLICON,OTHER,PCR,paired,ILLUMINA,Illumina HiSeq 2500,barcoded-subamplicon sequencing,fastq,/fh/fast/bloom_j/SR/ngs/illumina/lgentles/190419_M03100_0418_000000000

## Pre-upload the fastq files

Then I make a `.tar` of the `.fastq` files for all samples

In [66]:
# Make a tar of all gzipped fastq files
tar_file_name = 'A_Perth_2009_MAP_2B06_fastq.tar'
tar_cmd = ' '.join([
                    'tar',
                    '-cvf',
                    tar_file_name] + all_fastq_files)

print("Making a tar file of FASTQ files with the following command:\n"+tar_cmd)
!$tar_cmd

Making a tar file of FASTQ files with the following command:
tar -cvf A_Perth_2009_MAP_2B06_fastq.tar /fh/fast/bloom_j/SR/ngs/illumina/lgentles/190419_M03100_0418_000000000-CCW7R/Data/Intensities/BaseCalls/2B06-50ug_S1_L001_R1_001.fastq.gz
/fh/fast/bloom_j/SR/ngs/illumina/lgentles/190419_M03100_0418_000000000-CCW7R/Data/Intensities/BaseCalls/2B06-50ug_S1_L001_R2_001.fastq.gz /fh/fast/bloom_j/SR/ngs/illumina/lgentles/190419_M03100_0418_000000000-CCW7R/Data/Intensities/BaseCalls/Lib1-mock-rep2_S3_L001_R1_001.fastq.gz
/fh/fast/bloom_j/SR/ngs/illumina/lgentles/190419_M03100_0418_000000000-CCW7R/Data/Intensities/BaseCalls/Lib1-mock-rep2_S3_L001_R2_001.fastq.gz /fh/fast/bloom_j/SR/ngs/illumina/lgentles/190419_M03100_0418_000000000-CCW7R/Data/Intensities/BaseCalls/WSN-HA-plasmid_S2_L001_R1_001.fastq.gz
/fh/fast/bloom_j/SR/ngs/illumina/lgentles/190419_M03100_0418_000000000-CCW7R/Data/Intensities/BaseCalls/WSN-HA-plasmid_S2_L001_R2_001.fastq.gz /fh/fast/bloom_j/SR/ngs/illumina/bloom_lab/190524_

Next, I move the `.tar` file to its own subdirectory `fastq_files` so that I only transfer the fastq files with Aspera (see below).

In [68]:
!mkdir fastq_files
!mv A_Perth_2009_MAP_2B06_fastq.tar fastq_files/

mv: cannot stat 'A_Perth_2009_MAP_2B06_fastq.tar': No such file or directory
