## Uploading the deep-sequencing data from Nancy Hom's M1 analysis

This notebook describes how I am uploading the deep-sequencing data  for the M1 amino-acid preference analysis.

Lauren Gentles, April-30-2018

## Compile metadata on all the samples

First, I will maek an outfile with metadata for each sample. 

In [1]:
# Import python modules
import glob
import os
import sys

# Global variables
replicates = [1, 2, 3]
samples = ['DNA_Lib', 'DNA_WT', 'virus_WT', 'virus_Lib']
metadata_f = 'metadata.txt'

# List of fastq directories
fastqdirs = ['/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/*/']

# Compile metadata on each sample, including names of all fastq files for that sample
with open(metadata_f, 'w') as f:
    # Write first line in metadata file
    first_line = 'bioproject_accession\tsample_name\tlibrary_ID\ttitle\tlibrary_strategy\tlibrary_source\tlibrary_selection\tlibrary_layout\tplatform\tinstrument_model\tdesign_description\tfiletype\tfilename\tfilename2\tfilename3\tfilename4\tfilename5\tfilename6\tfilename7\tfilename8\tfilename9\tfilename10\tfilename11\tfilename12\tfilename13\tfilename14\n'
    f.write(first_line.replace('\t', ','))
    
    # Make rest of entries in file
    all_fastq_files = []
    for r in replicates:
        for s in samples:
            # Find all fastq files for a given sample
            print("\nGetting metadata for {0}-{1}. Here is a list of all fastq file basenames:".format(s,r))
            f1s = []
            for fastqdir in fastqdirs:
                if s in ['virus_WT', 'virus_Lib']:
                    f1s.extend(glob.glob('{0}/{1}{2}_*_R1_*.fastq*'.format(fastqdir, s, r)))
                else:
                    f1s.extend(glob.glob('{0}/*{1}{2}_*_R1_*.fastq*'.format(fastqdir, s, r)))
            f1s.sort()
            f2s = [f1.replace('_R1_', '_R2_') for f1 in f1s]
            for (f1, f2) in zip(f1s, f2s):
                assert os.path.isfile(f1) and os.path.isfile(f2)
                assert '{0}{1}'.format(s, r) in f1 and '{0}{1}'.format(s, r) in f2
                print (os.path.basename(f1))
                print (os.path.basename(f2))
            
            # Append fastq files for sample to list of all fastq files
            all_fastq_files.extend(f1s+f2s)

            # Make entry in the metadata file for sample
            sample_line = ' \tPR8 M1 libraries\t{0}-{1}\tdeep sequencing of library\tAMPLICON\tOTHER\tPCR\tpaired\tILLUMINA\tIllumina HiSeq 2500\tbarcoded-subamplicon sequencing\tfastq\t'.format(s,r) + '\t'.join(f1s+f2s) + '\n'
            f.write(sample_line.replace('\t', ','))


Getting metadata for DNA_Lib-1. Here is a list of all fastq file basenames:
DNA_Lib1_AGTCAA_L001_R1_001.fastq.gz
DNA_Lib1_AGTCAA_L001_R2_001.fastq.gz
DNA_Lib1_AGTCAA_L001_R1_002.fastq.gz
DNA_Lib1_AGTCAA_L001_R2_002.fastq.gz
DNA_Lib1_AGTCAA_L001_R1_003.fastq.gz
DNA_Lib1_AGTCAA_L001_R2_003.fastq.gz

Getting metadata for DNA_WT-1. Here is a list of all fastq file basenames:
DNA_WT1_ACAGTG_L001_R1_001.fastq.gz
DNA_WT1_ACAGTG_L001_R2_001.fastq.gz
DNA_WT1_ACAGTG_L001_R1_002.fastq.gz
DNA_WT1_ACAGTG_L001_R2_002.fastq.gz
DNA_WT1_ACAGTG_L001_R1_003.fastq.gz
DNA_WT1_ACAGTG_L001_R2_003.fastq.gz
DNA_WT1_ACAGTG_L001_R1_004.fastq.gz
DNA_WT1_ACAGTG_L001_R2_004.fastq.gz

Getting metadata for virus_WT-1. Here is a list of all fastq file basenames:
virus_WT1_GTGAAA_L001_R1_001.fastq.gz
virus_WT1_GTGAAA_L001_R2_001.fastq.gz
virus_WT1_GTGAAA_L001_R1_002.fastq.gz
virus_WT1_GTGAAA_L001_R2_002.fastq.gz
virus_WT1_GTGAAA_L001_R1_003.fastq.gz
virus_WT1_GTGAAA_L001_R2_003.fastq.gz

Getting metadata for virus_Lib

In [2]:
cat metadata.txt

bioproject_accession,sample_name,library_ID,title,library_strategy,library_source,library_selection,library_layout,platform,instrument_model,design_description,filetype,filename,filename2,filename3,filename4,filename5,filename6,filename7,filename8,filename9,filename10,filename11,filename12,filename13,filename14
 ,PR8 M1 libraries,DNA_Lib-1,deep sequencing of library,AMPLICON,OTHER,PCR,paired,ILLUMINA,Illumina HiSeq 2500,barcoded-subamplicon sequencing,fastq,/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib1/DNA_Lib1_AGTCAA_L001_R1_001.fastq.gz,/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib1/DNA_Lib1_AGTCAA_L001_R1_002.fastq.gz,/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib1/DNA_Lib1_AGTCAA_L001_R1_003.fastq.gz,/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample

## Pre-upload the fastq files

Then I make a `.tar` of the `.fastq` files for all samples

In [3]:
# Make a tar of all gzipped fastq files
tar_file_name = 'A_PR_8_1934_M1_DMS_fastq.tar'
tar_cmd = ' '.join([
                    'tar',
                    '-cvf',
                    tar_file_name] + all_fastq_files)

print("Making a tar file of FASTQ files with the following command:\n"+tar_cmd)
!$tar_cmd

Making a tar file of FASTQ files with the following command:
tar -cvf A_PR_8_1934_M1_DMS_fastq.tar /fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib1/DNA_Lib1_AGTCAA_L001_R1_001.fastq.gz /fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib1/DNA_Lib1_AGTCAA_L001_R1_002.fastq.gz /fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib1/DNA_Lib1_AGTCAA_L001_R1_003.fastq.gz /fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib1/DNA_Lib1_AGTCAA_L001_R2_001.fastq.gz /fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib1/DNA_Lib1_AGTCAA_L001_R2_002.fastq.gz /fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib1/DNA_Lib1_AGTCAA_L001_R2_003.fastq.gz /fh/fast/bloom_j/SR/ngs/i

tar: Removing leading `/' from member names
/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib1/DNA_Lib1_AGTCAA_L001_R1_001.fastq.gz
/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib1/DNA_Lib1_AGTCAA_L001_R1_002.fastq.gz
/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib1/DNA_Lib1_AGTCAA_L001_R1_003.fastq.gz
/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib1/DNA_Lib1_AGTCAA_L001_R2_001.fastq.gz
/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib1/DNA_Lib1_AGTCAA_L001_R2_002.fastq.gz
/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib1/DNA_Lib1_AGTCAA_L001_R2_003.fastq.gz
/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/P

/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib3/DNA_Lib3_ATGTCA_L001_R1_001.fastq.gz
/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib3/DNA_Lib3_ATGTCA_L001_R1_002.fastq.gz
/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib3/DNA_Lib3_ATGTCA_L001_R1_003.fastq.gz
/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib3/DNA_Lib3_ATGTCA_L001_R1_004.fastq.gz
/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib3/DNA_Lib3_ATGTCA_L001_R2_001.fastq.gz
/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib3/DNA_Lib3_ATGTCA_L001_R2_002.fastq.gz
/fh/fast/bloom_j/SR/ngs/illumina/mdoud/160722_D00300_0307_AHY27LBCXX/Unaligned/Project_mdoud/Sample_DNA_Lib3/DNA_Lib3_ATGTCA

Next, I move the `.tar` file to its own subdirectory `fastq_files` so that I only transfer the fastq files with Aspera (see below).

In [4]:
!mkdir fastq_files
!mv A_PR_8_1934_M1_DMS_fastq.tar fastq_files/

mkdir: cannot create directory ‘fastq_files’: File exists
