##Downloading FASTQ files from ENA
### What does this script do?
This script will help you stream RNA-seq data (FASTQ files) from ENA and quantify the RNA transcript abundance using kallisto concurrently.

### What you will require
1. A Google Account to save your data
2. A folder in your Google Drive containing<br>
  a) RunTable file<br>This file should contain the list of RunIDs that you want to download from ENA<br>
  b) CDS file of the organism that you are interested to work with

### Expected outputs
1. Kallisto index of your organism's CDS<br>
2. Kallisto output folders<br>
3. Download report<br>Summarises the status of download, amount of data downloaded, amount of time taken for kallisto streaming and a statistics from kallisto for each RunID

In [13]:
#@title Mount Google Drive

#Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')
!rm -rf /content/sample_data

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [14]:
import os
import time
from datetime import datetime as dt
import scipy.stats as stats
import json
import math

In [15]:
#@title Input form {display-mode: "form"}

#@markdown Enter the species name in the format "_**Genus_species**_", Eg: '_**Nicotiana tabacum**_'

#@markdown The folder name in your Google Drive main directory should also go by this name.

species_name = 'data' #@param {type: 'string'}

#@markdown ---

#@markdown File name for Run Table (with extension), stored in Google Drive species folder.

#@markdown Eg: '_**RunTable_Nicotiana_tabacum.txt**_'

RunTable_file = 'runid_Nta_short.txt' #@param {type: 'string'}

#@markdown ---

#@markdown File name for CDS file (with extension), stored in Google Drive species folder.

#@markdown Eg: '_**cds.selected_transcript.egu.fasta.gz**_'

cds_fasta_file = 'Nitab-v4.5_cDNA_Edwards2017.fasta' #@param {type: 'string'}

#@markdown ---

#@markdown Specify download mode

download_mode = "A. Start fresh run" #@param ["A. Start fresh run", "B. Continue from previous run"]
download_mode = download_mode[0]

if download_mode == "B":
  Date_initiated = '2020-01-20' #@param {type: 'date'}
  date = Date_initiated

In [16]:
#Create Dependencies directory
if os.path.exists('/content/Dependencies') == False:
  os.mkdir('/content/Dependencies')
  os.chdir('/content/Dependencies')
  print('Dependencies directory created.')
else:
  os.chdir('/content/Dependencies')
  print('Dependencies directory already present.')

#Download and install kallisto
if os.path.exists('kallisto_linux-v0.46.0.tar.gz') == False:
  os.system('wget \'https://github.com/pachterlab/kallisto/releases/download/v0.46.0/kallisto_linux-v0.46.0.tar.gz\'')
  if os.path.exists('kallisto_linux-v0.46.0.tar.gz') == True:
    print('kallisto .gz file downloaded.')
  else:
    print('kallisto .gz file download failed.')
else:
  print('kallisto .gz file already present.')

os.system('tar -xf kallisto_linux-v0.46.0.tar.gz')
if os.path.exists('kallisto/kallisto'):
  print('kallisto installed.')
  !cp kallisto/kallisto /bin/kallisto
else:
  print('kallisto not found.')

Dependencies directory already present.
kallisto .gz file already present.
kallisto installed.


In [17]:
#Define paths
working_dir_path = "/content/gdrive/My Drive/" + species_name + "/"
working_dir_path_ter = "/content/gdrive/My\ Drive/" + species_name + "/"
  #Note: "\(whitespace)" is needed when we are calling shell command as a string via os.system
RunTablePath = working_dir_path + RunTable_file
cds_fasta_path = working_dir_path_ter + cds_fasta_file

In [31]:
#Make new directory for this execution of the script if user chooses option "A"
os.chdir(working_dir_path)

if download_mode == "A":
  date = str(dt.now().date())
  files = os.listdir(working_dir_path)
  try:
      os.mkdir(working_dir_path + date + "_01")
      print(date + "_01 directory has been created.") 
  except FileExistsError:
      filename = max([filename for filename in files if date in filename])
      file_serial_int = int(filename[-2:]) + 1
      if 1 < file_serial_int < 10:
          file_serial_str = "0" + str(file_serial_int)
      elif 10 <= file_serial_int < 100:
          file_serial_str = str(file_serial_int)
          
      os.mkdir(working_dir_path + date + "_" + file_serial_str)
      print(date + "_" + file_serial_str + " directory has been created.")
  except:
      print("Directory failed to be created.")

#Calls the most recent directory
#Will start from here for Option "B"
files = os.listdir(working_dir_path)
filename = max([filename for filename in files if date in filename])
execution_dir_path = working_dir_path + filename + "/"
execution_dir_path_ter = working_dir_path_ter + filename + "/"

2021-02-11_01 directory has been created.


In [32]:
#Download report
#Create a tab-separated .txt logfile that stores time and progress in this workflow
os.chdir(execution_dir_path)
#species_name = (RunTablePath.split('/')[-1]).split('_',1)[1][:-4]
download_report_name = "Download_report_" + species_name + "_" + date + ".txt"

if os.path.exists(download_report_name):
  pass
  #For Option B, Download report will be read later in the for loop

else:
  #Create new download report
  download_report = open(download_report_name, "a+")
  download_report.write("Run ID\tLibrary Layout\tStatus\tKallisto time\n")
  download_report.close()

In [33]:
#Create kallisto index
kallisto_index_path_ter = execution_dir_path_ter + "index_file_" + species_name #to be created by kallisto
kallisto_index_path = execution_dir_path + "index_file_" + species_name

if os.path.exists(kallisto_index_path):
  print("Kallisto index already present for " + species_name + ".")
else:
  index_start = time.time()
  # os.system(kallisto_path + " index -i " + kallisto_index_path_ter + " " + cds_fasta_path)
  !kallisto index -i $kallisto_index_path_ter $cds_fasta_path
  if os.path.exists(kallisto_index_path):
    print("Kallisto index created for " + species_name + ".")
    print("Time to create kallisto index:", time.time()-index_start)
  else:
    print("Kallisto index not found for " + species_name + ".")


[build] loading fasta file /content/gdrive/My Drive/data/Nitab-v4.5_cDNA_Edwards2017.fasta
[build] k-mer length: 31
        from 1 target sequences
        with pseudorandom nucleotides
[build] counting k-mers ... tcmalloc: large alloc 1610612736 bytes == 0x6929e000 @  0x7fd1ee3341e7 0x6f181d 0x6f1899 0x4acad9 0x4a4ca8 0x4abe49 0x44e1d4 0x7fd1ed350bf7 0x452a59
done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 1316047 contigs and contains 61223759 k-mers 

Kallisto index created for data.
Time to create kallisto index: 234.4896912574768


In [34]:
#@title ####Download Functions#############################################################

def get_ftp_links(RunID):
  '''(str)->(str,str)
  Return ftp link in the paired and unpaired format for the RunID specified
  '''
  dir2 = ""
  if 9 < len(RunID) <= 12:
      dir2 = "0"*(12 - len(RunID)) + RunID[-(len(RunID)-9):] + "/"
  dirs = RunID[:6] + "/" + dir2 + RunID
  ftp_link_paired = "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/" + dirs + "/" + RunID + "_1.fastq.gz"
  ftp_link_unpaired = "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/" + dirs + "/" + RunID + ".fastq.gz"
  return ftp_link_paired, ftp_link_unpaired

######################################################################

def kallisto_stream(RunID):
  '''(str)->(float,str)
  Runs kallisto quant on streamed fastq file for each RunID, streaming the unpaired file first, and then the paired if the unpaired file is absent.
  Streams only the first 1M bytes of data.
  Ensures that:
    Download speed not < 1x10^9 Bytes for 30 s
    Maximum time taken = 600 s = 10 min
  If these speed/time are not met, download is terminated and restarted for a total of 3 tries.
  '''
  RunID_file_path = execution_dir_path + RunID + "/"
  paired, unpaired = get_ftp_links(RunID)

  for i in range(3): #try downloading at most 3 times
    layout = "Layout unknown"
    kallisto_start = time.time()
    # Dl first 1m bytes, max time 600s, speed limit ~1mb for 30s, send stderr to RunID.log
    !kallisto quant -i $kallisto_index_path_ter -o $RunID --single -l 200 -s 20 -t 2 <(curl -L -r 0-1000000000 -m 600 --speed-limit 1000000 --speed-time 30 $unpaired 2> $RunID'.log')
    layout = "Single"
    if '"n_processed": 0' in open(execution_dir_path + RunID+'/run_info.json','r').read(): ##checking if the unpaired mode did not align any reads. If true, the run was most likely paired-end.
      !kallisto quant -i $kallisto_index_path_ter -o $RunID --single -l 200 -s 20 -t 2 <(curl -L -r 0-1000000000 -m 600 --speed-limit 1000000 --speed-time 30 $paired 2> $RunID'.log')
      layout = "Paired"
    kallisto_end = time.time()

    #check for stderr logs, if absent, first 1GB file download is complete
    if 'curl: (28)' not in open(RunID + '.log','r').read(): #if no slow dl speed error, acccept 
      !rm $RunID".log"
      print(RunID + '[i=' + str(i) + ']: Download speed/time is acceptable.')
      break
    
    if RunID + '.log' in os.listdir(execution_dir_path): #remove stderr log file before contining to the next attempt of download
      !rm $RunID".log"
      print(RunID + '[i=' + str(i) + ']: Download speed/time is not accepted.')
  
  #If download still incomplete, use the last file saved
  kallisto_time = round(kallisto_end - kallisto_start, 4)
  
  return kallisto_time, layout

######################################################################

def get_ListOfRunID(RunTablePath):
  with open(RunTablePath,"r") as RunTable:
    ListOfRunID = [RunID.strip().upper() for RunID in RunTable.readlines()]
  if "RUNID" in ListOfRunID[0]: #If input file has header, exclude header
    ListOfRunID = ListOfRunID[1:]
  return ListOfRunID

######################################################################

def open_download_report():
  with open(download_report_name, "r") as download_report:
    download_lines = download_report.readlines()
  download_entries = [line.strip().split("\t") for line in download_lines]

  return download_lines, download_entries

######################################################################

def get_comments_index(download_lines):
  hex_line_indices = [download_lines.index(line) for line in download_lines if "#" in line]
  started_indices = [index for index in hex_line_indices if "started" in download_lines[index]]
  completed_indices = [index for index in hex_line_indices if "completed" in download_lines[index]]

  return hex_line_indices, started_indices, completed_indices

######################################################################

def get_failed_RunID_B():
  '''
  * CALLED WHEN REDOWNLOADING (j=1 or 2) WITH MODE B *

  Opens Download Report;
  Collate failed RunIDs from the latest COMPLETED j loop.
  '''
  
  #Need to reopen download report to compile failed RunIDs
  download_lines, download_entries = open_download_report()
  hex_line_indices, started_indices, completed_indices = get_comments_index(download_lines)
  
  j_head_index = started_indices[-2] # If 1<=j<=2, then 2<=len(started_indices)<=3

  # Find failed RunIDs within last completed j loop, in chronological order
  list_of_failed_RunID = []
  for index in range(j_head_index, completed_indices[-1]):
    if index not in hex_line_indices and download_entries[index][2] == "0 reads processed":
      list_of_failed_RunID.append(download_entries[index][0])

  return list_of_failed_RunID

######################################################################

def get_failed_RunID_A():
  '''
  * CALLED WHEN MOVING ON TO THE NEXT j LOOP *

  Opens Download Report;
  Collate failed RunIDs from the latest COMPLETED j loop.
  '''
  
  #Need to reopen download report to compile failed RunIDs
  download_lines, download_entries = open_download_report()
  hex_line_indices, started_indices, completed_indices = get_comments_index(download_lines)
  
  j_head_index = started_indices[-1] # If 1<=j<=2, then 2<=len(started_indices)<=3

  # Find failed RunIDs within last completed j loop, in chronological order
  list_of_failed_RunID = []
  for index in range(j_head_index, completed_indices[-1]):
    if index not in hex_line_indices and download_entries[index][2] == "0 reads processed":
      list_of_failed_RunID.append(download_entries[index][0])

  return list_of_failed_RunID

######################################################################

def get_j():
  '''
  Get the current j loop download was paused at.
  '''
  download_lines, download_entries = open_download_report()
  hex_line_indices, started_indices, completed_indices = get_comments_index(download_lines)
  j = len(completed_indices)
  return j

######################################################################

def get_RunID_start(RunID_queue):
  '''
  * CALLED ONCE WHEN REDOWNLOADING WITH MODE B ONLY *

  Checks from the bottom of Download Report upwards until the lastest #start.
  Takes the latest RunID.
  RunID_start_index will be the index of the next RunID in RunID_queue.
  If all RunID in queue completed, index will simply = to len(RunID_queue),
  will move on to next j loop
  '''
  download_lines, download_entries = open_download_report()
  hex_line_indices, started_indices, completed_indices = get_comments_index(download_lines)

  for i in range((len(download_entries) - 1), (started_indices[-1]), -1):
    if i not in hex_line_indices:
      RunID_latest = download_entries[i][0]
      break

  try:
    RunID_start_index = RunID_queue.index(RunID_latest) + 1
  except:
    RunID_start_index = 0 # len(RunID_queue) will also be 0, move on to next j loop. 
    print("No RunID_start_index generated.")

  return RunID_start_index

######################################################################

def update_download_report(to_print):
  with open(download_report_name, "a+") as download_report:
    download_report.write(to_print)

######################################################################

def download_loop(RunID_start_index, RunID_queue):
  '''
  Execute inner download_loop i=3 for all RunIDs in RunID_queue;
  Updates download report as each RunID is processed.
  '''
  for index in range(RunID_start_index, len(RunID_queue)):
    RunID = RunID_queue[index]
    print()
    print("-"*40)
    print()
    job_queue = index + 1
    total_queue = len(RunID_queue)
    print('Processing ' + str(job_queue) + "/" + str(total_queue) + ": " + RunID)

    RunID_file_path = execution_dir_path + RunID + "/"
    if os.path.exists(RunID_file_path) == False:
      os.mkdir(RunID_file_path) #Directory to store kallisto files of each RunID
    download_status = [RunID, "N/A", "N/A", "N/A"]
    '''
    *Download Report Headers*
    RunID | Library Layout | Status | Kallisto time

    *Possible output for "Method"*
    download_status[3]
      "N/A" -> No streaming attempted yet
      "Streaming (Paired)" -> Successfully streamed and quantify paired-end data
      "Streaming (Single)" -> Successfully streamed and quantify single-end data
      "0 reads processed" -> Failed to stream at a satisfactory speed
    '''
    #Streaming method
    kallisto_time, layout = kallisto_stream(RunID)
    if os.path.exists(RunID_file_path + "run_info.json"):
      download_status[1] = layout
      download_status[3] = str(kallisto_time)
      if '"n_processed": 0' not in open(RunID_file_path + "run_info.json", "r").read():
        download_status[2] = "Streamed successfully"
        print(RunID + ": kallisto output downloaded by streaming. (" + layout + ")")
      else:
        download_status[2] = "0 reads processed"
        print(RunID + ": 0 reads processed")
    else:
      print(RunID + ": Missing kallisto output")

    update_download_report("\t".join(download_status) + "\n")

  return True

In [35]:
ListOfRunID = get_ListOfRunID(RunTablePath)

# Specify variables for mode A or B

if download_mode == "A":
  j, RunID_start_index = 0, 0
  RunID_queue = ListOfRunID

elif download_mode == "B":
  j = get_j()

  if j == 3:
    RunID_queue = [] #End download. j loop completed thrice.
  else:
    if j == 0:
      RunID_queue = ListOfRunID
    elif 1 <= j <= 2:
      RunID_queue = get_failed_RunID_B()
    if RunID_queue == []: # End j loop if no more failed RunID.
      j = 3
  
  RunID_start_index = get_RunID_start(RunID_queue)
  update_download_report("#Download resumed\n")

In [36]:
#@title Code for Download Loop

# Loop through j x i times down the RunID_queue

for loop in range(j,3):
  print("\n" + "-"*40 + "\n")
  if RunID_start_index == 0:
    print("Download attempt %s"%(loop+1))
    update_download_report("#Download attempt %s started\n" % (loop+1))
  else:
    print("Download attempt %s resumed"%(j+1))
  
  download_loop(RunID_start_index, RunID_queue)
  update_download_report("#Download attempt %s completed\n" % (loop+1))

  #Reset RunID_start_index and RunID_queue
  RunID_start_index = 0
  RunID_queue = get_failed_RunID_A()
  if RunID_queue == []: # End j loop if no more failed RunID.
      print("§§§§§§§§§§§§§§§§§§§§§§All RunIDs have been successfully downloaded.§§§§§§§§§§§§§§§§§§§§§§")
      break

print("Download complete.")


----------------------------------------

Download attempt 1

----------------------------------------

Processing 1/5: SRR5387717

[quant] fragment length distribution is truncated gaussian with mean = 200, sd = 20
[index] k-mer length: 31
[index] number of targets: 69,500
[index] number of k-mers: 61,223,759
tcmalloc: large alloc 1610612736 bytes == 0x29e4000 @  0x7fc5ce9c11e7 0x6f181d 0x6f1899 0x4acad9 0x4a6c50 0x44ec75 0x7fc5cd9ddbf7 0x452a59
[index] number of equivalence classes: 224,789
[quant] running in single-end mode
[quant] will process file 1: /dev/fd/63
[quant] finding pseudoalignments for the reads ... done
[quant] processed 0 reads, 0 reads pseudoaligned
[~warn] no reads pseudoaligned.
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds


[quant] fragment length distribution is truncated gaussian with mean = 200, sd = 20
[index] k-mer length: 31
[index] number of targets: 69,500
[index] number of k-mers: 61,223,75