# Generate a data table with data from Gen3 for use with the IGV in Terra

The [Integrative Genomics Viewer](http://software.broadinstitute.org/software/igv/) is an interactive visualization tool for large genomic files. The tool is available within Terra. This notebook leads you through some data wrangling steps to generate a data table that works with the [IGV tool in Terra](https://support.terra.bio/hc/en-us/articles/360029654831-Viewing-IGV-tracks-of-BAM-files-in-your-workspace-data) using data imported from Gen3. 

The final data table generated will look like this:

| IGV_Viewer_id | crai           | cram           | 
|---------------|----------------|----------------|
| 0         | NWD1.crai |NWD1.cram  | 
| 1         | NWD2.crai  | NWD2.cram  | 


Outline of steps in this notebook:
1. Transfer a project with your samples of interest from Gen3 to your Terra workspace using these [instructions](). The genomic data that arrives in your data tables are DRS links. 
2. Use DRS tooling from [terra_notebook_utils package](https://support.terra.bio/hc/en-us/articles/360039330211) to physically copy the genomic data of interest to your Terra workspace. In this notebook, you will provide the TOPMed NWD sequencing ID to find the CRAM and CRAI files of interest. The IGV tool in Terra cannot resolve the data through drs:// URLS. **Note:** you will be paying storage costs for any data you copy to your workspace. You may want to delete these files when you are finished viewing them. 
3. Generate a new data table, IGViwer_id, where each row represents an individual and columns contain links to the data in your workspace (gs://*.cram). 
4. Navigate to the data section of your workspace and open the IGViewer table. Follow the instructions in step 1 of this [document](https://support.terra.bio/hc/en-us/articles/360029654831-Viewing-IGV-tracks-of-BAM-files-in-your-workspace-data).
5. You may want to eventually delete the CRAM and CRAI files to avoid paying long-term storage costs. 

Install DRS packages

In [None]:
%pip install --upgrade --no-cache-dir terra-notebook-utils
%pip install --upgrade --no-cache-dir gs-chunked-io

Import the tooling and define some functions

In [None]:
import os
import terra_notebook_utils as tnu
from firecloud import fiss

def get_drs_urls(table_name):
    """
    Return a dictionary containing drs urls and file names, using sample as the key.                      
    """
    info = dict()
    for row in tnu.table.list_entities(table_name):                                                       
        drs_url = row['attributes']['pfb:object_id']
        file_name = row['attributes']['pfb:file_name']
        # Assume file names have the format `NWD244548.b38.irc.v1.cram`
        sample = file_name.split(".", 1)[0]
        info[sample] = dict(file_name=file_name, drs_url=drs_url)                                         
    return info

def upload_data_table(tsv):         
    billing_project = os.environ['GOOGLE_PROJECT']
    workspace = os.environ['WORKSPACE_NAME']
    resp = fiss.fapi.upload_entities(billing_project, workspace, tsv, model="flexible")
    resp.raise_for_status()

In [None]:
crams = get_drs_urls("submitted_aligned_reads")                                                           
crais = get_drs_urls("aligned_reads_index")

In [None]:
print(crams["NWD596479"])
print(crais["NWD596479"])

Check that `table.tsv` exists in the `Files` data directory

In [None]:
!ls

Extract samples from input table located in the `Files` Data directory

In [None]:
import os 
from collections import defaultdict
BILLING_PROJECT_ID = os.environ['GOOGLE_PROJECT']
WORKSPACE = os.environ['WORKSPACE_NAME']
bucket = os.environ['WORKSPACE_BUCKET']

!gsutil cp $bucket/table.tsv .

samples = defaultdict(list)
with open("table.tsv", 'r') as test_table:
    for line in test_table:
        sample_name = line.split('\t')[0]
        var_id = line.split('\t')[1]
        var_range = line.split('\t')[2].strip()
        samples[sample_name].append([var_id,var_range])


Check that this worked

In [None]:
print(samples.keys())
print(samples[list(samples.keys())[0]])

Copy the CRAM and CRAI files for the selected samples to the Terra workspace bucket. 

In [None]:
bucket = os.environ['WORKSPACE_BUCKET']
pfx = "test-crai-cram"
tsv_data = "\t".join(["cram_crai_id", "inputs", "output"])
for sample in samples.keys():
    cram = crams[sample]
    crai = crais[sample]
    tnu.drs.copy(cram['drs_url'], f"{bucket}/{pfx}/{cram['pfb:file_name']}")
    tnu.drs.copy(crai['drs_url'], f"{bucket}/{pfx}/{crai['pfb:file_name']}")

Create a table called "igv_table.tsv" that sets up the igv wdl input with all CRAM and CRAI file URLs that were copied to the workspace

In [None]:
with open ("igv_table.tsv", 'w') as igv_table:
    for sample in samples.keys():
        for sample_var in samples[sample]:
            var_id = sample_var[0].strip('\n')
            var_range = sample_var[1].strip('\n')
            sample_cram = crams[sample]
            sample_crai = crais[sample]
            igv_table.write(f"{sample}\t" +
                            f"{var_id}\t" +
                            f"{var_range}\t" +
                            "{}/{}/{}\t".format(bucket,pfx,sample_cram['pfb:file_name'].strip('\n')) +
                            "{}/{}/{}\n".format(bucket,pfx,sample_crai['pfb:file_name'].strip('\n')))
BILLING_PROJECT_ID = os.environ['GOOGLE_PROJECT']
WORKSPACE = os.environ['WORKSPACE_NAME']
bucket = os.environ['WORKSPACE_BUCKET']

!gsutil cp igv_table.tsv $bucket/

Check that `igv_table.tsv` exists in the `Files` data directory

In [None]:
!ls
!pwd

When you are done viewing, delete the files you copied so you avoid paying long-term storage costs. If you delete the data table, this doesn't actually delete the files in your bucket. You will need to navigate to the "file" section of your workspace and individually delete the files in the "folders" labeled "cram_crai".

You can also delete files using `! gsutil rm` with the path to the files you want to delete. 