In [1]:
from utils import ftp_upload
import yaml
import os
import pandas as pd

# Uploading your sequencing runs to the SRA

Before publishing your study, you must make the raw sequencing runs available. This notebook contains the instructions and code for uploading these `fastq` files to the [NCBI Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra).

The recommended way to organize your experiments on the SRA is as follows:

```
BioProject: Measles isolated from a case of Subacute Sclerosing Panencephalitis (SSPE)
|
|___BioSamples: Illumina runs for 15 brain tissue samples isolated from the autopsy of a patient with SSPE
	|
	|___SRA Experiments: Each individual tissue sample (paired-end)
```

This structure is in keeping with the scheme used in Tyler's and Allie's Yeast Display DMS projects like this one [here](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA639956). You can see what this looks like for an individual run from it's [`BioProject`](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA639956) >> to it's [`BioSample`](https://www.ncbi.nlm.nih.gov/biosample/19925005) >> and finally, the [`SRA Run`](https://www.ncbi.nlm.nih.gov/sra/SRX11291810[accn]) itself.


**Note:** *The instructions below are applicable whether you have a new variant library for which a* `BioProject` *doesn't exist, or whether you are uploading a new* `BioSample` *to an existing* `BioProject`.

---

<details>
  
  <summary>
      <h1>&#9758; Click to Create the BioProject</h1>
  </summary>
    
*If you haven't yet created a* `BioProject` *for this Variant Library, follow the instructions here.* 
    
First, go to the [SRA Submission Portal](https://submit.ncbi.nlm.nih.gov/subs/) and login. Once logged into the submission portal, click on **`BioProject`** and click on the blue **`New Submission`** button. 
    
When you click this button, you will be prompted to fill out the **`Submitter`** information, do so. After this, you'll fill out the **`Project Type`** information, for **`Project Data Type`**, select `Raw sequence reads`. and for **`Sample scope`** select `Synthetic`. Fill in the **`Target`** and **`General Info`** pages with applicable information describing your library. For **`BioSample`** do not add anything, just hit **`Continue`**. 
    
After you've filled out each of the fields, check that the information is correct and submit the new BioProject. It can take a bit for the project to be added and get a BioProject Accession. 
  
</details>

<p><strong>If you haven't yet created a </strong><code>BioProject</code><strong> for your variant library, click the pointing finger ( &#9758; ) to see how to set this up.</strong></p>

In [3]:
# After you've created the BioProject,
# or if you already have a BioProject that you'd like to add to,
# put the correct Accession here.

BioProject_ID = "PRJNA1071684"

---
# Create the BioSample for your Barcodes

**Otherwise, if your** `BioProject` **already exists and you are uploading barcode runs, start with these instructions.**

First, go to the [SRA Submission Portal](https://submit.ncbi.nlm.nih.gov/subs/) and login.

Then under the **`Start a new submission`** banner, click **`BioSample`** to create a new BioSample. You are now on a page that has a blue **`New Submission`** button, which you should select. This will bring you to a page with some Submitter information; check that it is correct and then hit **`Continue`**. 

You will then be on the **`General Information`** page. Select when to release the submission to the public (*generally, immediately is OK*). Then click that you are uploading a `Single BioSample` and hit **`Continue`**.

You will now be at the **`Sample Type`** page, and you have to select the package that best describes the submission. Click `Pathogen` and `clinical, or host associated`. Then click **`Continue`**. 

Now, you will enter the sample attributes. For the sample name, provide a short name that describes the sample. Also provide the rest of the information by editing the first section of the [`config.yml`](config.yml) in this directory. **Run the cell below to check what you've set and will need to add to the SRA.** 

In [4]:
with open('config.yml') as f:
    config = yaml.safe_load(f)

for k,v in config['attributes'].items():
    print(f"{k}: '{v}'")

organism: 'SARS-CoV-2'
strain: 'hCoV-19/HongKong/VM20109236/2020'
collected_by: 'Hui-Ling Yen'
collection_date: '2020'
host: 'Syrian Hamster'
host_disease: 'COVID-19'
isolation_source: 'nasal'


Then hit **`Continue`**.

You will now be on the page to specify the `BioProject`. You should either be adding to an existing BioProject, or have created a new on according to the instructions above. Enter the correct ID in the format of `PRJNAXXXXXX` as the **`Existing BioProject`** and hit **`Continue`**.

Finally, add a sample title. Then hit **`Continue`**, make sure everything looks correct, then hit **`Submit`**.

After a brief bit of processing, the `BioSample` submission should show up, along with a sample accession that will be in the format of `SAMNXXXXXXX`. Add this sample accession to [`config.yml`](config.yml) as the value for the `accession` key under the `runs` key.

In [5]:
with open('config.yml') as f:
    config = yaml.safe_load(f)
print(f"The BioSample accession for these Illumina Barcode runs is: {config['runs']['accession']}")
print(f"The BioProject accession for these Illumina Barcode runs is: {BioProject_ID}")

The BioSample accession for these Illumina Barcode runs is: SAMN39709134
The BioProject accession for these Illumina Barcode runs is: PRJNA1071684


---
## Upload the sequencing data

Go back to the [SRA Submission Portal](https://submit.ncbi.nlm.nih.gov/subs/) and login again. This time, under the **`Start a new submission banner`**, click **`Sequence Read Archive`** to upload the actual sequencing data.

You are now on a page with a **`New submission`** button, which you should click. Check that the submitter information is correct, then click **`Continue`**.

You will now be at the **`General Information`** page. We are adding to an existing `BioProject`, so enter the correct `BioProject` accession as the **`Existing BioProject`**. Then for the question of whether you already registered a `BioSample`, also select yes. Then select a Release Date depending on whether you want to release the results immediately (*usually fine*) or at some future date. Then click **`Continue`**.

You will now be at a page that asks how you want to provide the `SRA Metadata`. Click to Upload a file. The next section describes how to create this table.

---
## Create SRA metadata submission table

To create the SRA metadata submission table for the Illumina Barcode runs, you'll edit the values under the `barcode_runs` key in the [`config.yml`](config.yml) file in this directory. 

Most of the values should be self explanatory, however, for the key `sample_id_columns`, choose the columns that should be concatenated in order to make a unique `library_ID` for the submission. **The order of the columns matters**. 

After you've edited the [`config.yml`](config.yml) file, run the cell below to double-check that the values are correct. 

In [6]:
with open('config.yml') as f:
    config = yaml.safe_load(f)

print(*[f"{k}: {v}\n" for k,v in config['runs'].items() if k != "ftp_subfolder"])

accession: SAMN39709134
 sample_id_columns: ['Animal', 'Pair', 'Experiment', 'Condition', 'DPI', 'DPC', 'Replicate']
 title_prefix: Illumina sequencing of SARS-CoV-2 amlified RNA
 description: Illumina sequencing from SARS-CoV-2 infected Syrian Hamster nasal samples
 strategy: RNA-Seq
 source: VIRAL RNA
 selection: PCR
 layout: paired
 platform: ILLUMINA
 model: Illumina iSeq 100



If these values look correct, go ahead and make two files:

1. A file called `SRA_metadata.tsv` for the Illumina Barcode runs that you're including in this `BioSample` submission. This file should follow the format that SRA provides. 

2. A file called `fasta_files.csv` that contains a filename_fullpath and filename column. 

**Check that the table looks like you expect it to!**

In [7]:
path_to_sequencing_data = "../results/filtered/"


# Lists to store full paths and filenames
fullpaths = []
filenames = []

# Walk through the directory structure
for dirpath, dirnames, files in os.walk(path_to_sequencing_data):
    for filename in files:
        if filename.endswith('R1.fastq.gz') or filename.endswith('R2.fastq.gz'):
            dirname = os.path.basename(dirpath)  # get the name of the parent directory
            fullpath = os.path.join("/fh/fast/bloom_j/computational_notebooks/whannon/2022/MeV_SSPE_Dynamics/results/filtered", dirname, filename)
            fullpaths.append(fullpath)
            filenames.append(filename)


# Create a DataFrame
df = pd.DataFrame({
    'filename_fullpath': fullpaths,
    'filename': filenames
})

# Write to a CSV file
df.to_csv('file_paths.csv', index=False)


In [8]:
SRA_metadata = pd.read_csv("SRA_metadata.tsv", sep="\t")
SRA_metadata.head()

FileNotFoundError: [Errno 2] No such file or directory: 'SRA_metadata.tsv'

---
## Upload the submission table

Now return to the [SRA Submission Portal](https://submit.ncbi.nlm.nih.gov/subs/sra/) webpage, and if needed navigate back to your submission. You should still be at the **`SRA Metadata`** step, and see a **`Choose file`** box to upload your `barcode_SRA_metadata.tsv` file that you just created in the step above. Then click **`Continue`**.

After a little while, you should now get a page (**`Files`**) that asks you how you want to upload the files for this submission. Click the option for **`FTP or Aspera Command Line file preload`**.

If you click on the **`+`** FTP upload instructions, you will see details. Add the `Username` and `Account folder` provided in these instructions to the [config.yml](config.yml) file as the values for the keys `ftp_username` and `ftp_account_folder` at the bottom of the file. Also, add a value for the key `ftp_subfolder` that is meaningful for this particular submission under the `barcode_runs` key. Finally, put the FTP password as plain text in a file called `ftp_password.txt` which is not tracked in this repo.

In [24]:
with open('config.yml') as f:
    config = yaml.safe_load(f)
    
print(f"""
ftp_username: {config['ftp_username']}
ftp_account_folder: {config['ftp_account_folder']}
ftp_subfolder: {config['ftp_subfolder']}
""")

if os.path.isfile('ftp_password.txt'):
    print("ftp_password.txt Exists!")
else:
    raise Exception("Make sure that ftp_password.txt exists in this directory.")



ftp_username: subftp
ftp_account_folder: uploads/wwh22_uw.edu_e2LIS4RJ
ftp_subfolder: MeV_SSPE_Sequencing

ftp_password.txt Exists!


---
## Upload the sequencing data

Now we need to upload the actual sequencing data. This is done by running the cells below. It first creates a very large `*.tar` file called `SRA_submission.tar` that contains all the `fastqs` specified in the Illumina Barcode metadata. 

**Run the cell below to make this file. This can take a bit.**

In [27]:
ftp_upload.make_tar_file("file_paths.csv", "SRA_submission.tar")

Concatenating and Adding file 1 of 30 to SRA_submission.tar
Concatenating and Adding file 10 of 30 to SRA_submission.tar
Concatenating and Adding file 20 of 30 to SRA_submission.tar
Concatenating and Adding file 30 of 30 to SRA_submission.tar
Added all files to SRA_submission.tar

Removing tmp directory.

The size of SRA_submission.tar is 9.5 GB

SRA_submission.tar contains all 30 expected files.

Finished preparing the tar file for upload.


If the size of the `*.tar` file is what you expect, **run the chunk below to use FTP to upload the file to the SRA.** This can take a pretty long time to finish running, <strong style=color:red;>so if you have more than 5 gbs of data to upload, do not run this.</strong>

In [9]:
# ftp_upload.upload_via_ftp(tar_path="barcode_SRA_submission.tar",
#                ftp_username=config['ftp_username'],
#                ftp_account_folder=config['ftp_account_folder'],
#                ftp_subfolder=config['ftp_subfolder'],
#                ftp_address='ftp-private.ncbi.nlm.nih.gov',
#                ftp_password='ftp_password.txt'
#               )

<strong style=color:red;>Instead, because this can take a very long time for large amounts of data, run this from the command line and submit to `slurm` as follows:</strong> 

```
sbatch --wrap "python utils/ftp_upload.py --config config.yml --sampletype illumina" --time 2-0
```

Now that the transfer has finished, **manually log into the FTP site to see the file and use** `ls` **to see the size of what has been transferred to make sure that it worked correctly.** 

Finally, return to the SRA submission webpage for the reads, and check the blue **`Select preload folder`** box. Note that you need to wait about 10 minutes for the pre-load folder to become visible. The click to select the folder you created (this is the `ftp_subfolder` defined in `config.yaml`) and click Use selected folder. Finally, check **`Autofinish submission`** box and hit Continue. You will get a warning that files are missing since you uploaded a `*.tar` archive; do not worry about this and just click **`Continue`**.

The webpage will then indicate it is extracting files from the `*.tar`, so wait for this to finish. It should then show that your submission is complete and just waiting for processing.

You then probably want to delete the `barcodes_SRA_submission.tar` file as it is very large.

<h1 style=color:green;>&#127881; Congrats, you're done! &#127881;</h1>