# Create Data Tables For HiFi Data QC<a class="tocSkip">

**This notebook automatically reads in data stored in the AnVIL_HPRC workspace's bucket and generates data table so the data can be run through QC (NTSM and ReadStat's WDL's)**

**Note that this notebook requires the following inputs**
1. Pedigree file: maps child ID to maternal and paternal IDs. Also used to pull the sample ID from the file key.

**Below are the steps taken in this notebook:**
1. Import Statements & Global Variable Definitions
2. Define Functions
3. Read In Sample Names
4. Create Dataframe Of Files
5. Write data frame to data tables

# Import Statements & Global Variable Definitions

## Installs

In [None]:
## For reading CSVs stored in Google Cloud (without downloading them first)
## May need to restart kernel after install 
%%capture
%pip install gcsfs

In [None]:
## For reading/writing data tables into pandas data frames
## May need to restart kernel after install 
%%capture
%pip install --upgrade --no-cache-dir --force-reinstall terra-pandas
%pip install --upgrade --no-cache-dir  --force-reinstall git+https://github.com/DataBiosphere/terra-notebook-utils

## Import Statements

In [None]:
from firecloud import fiss
import pandas as pd         
import os                 
import subprocess       
import re                 
import io
import gcsfs

from typing import Any, Callable, List, Optional
from terra_notebook_utils import table, WORKSPACE_NAME, WORKSPACE_GOOGLE_PROJECT
from terra_pandas import dataframe_to_table, table_to_dataframe

## Global Variable Declarations

In [None]:
# AnVIL_HPRC WorkspaceBucket
anvil_hprc_bucket       = "gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/"

# Get the Google billing project name and workspace name for current workspace
PROJECT = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE =os.path.basename(os.path.dirname(os.getcwd()))
bucket = os.environ['WORKSPACE_BUCKET'] + "/"


# Verify that we've captured the environment variables
print("Billing project: " + PROJECT)
print("Workspace: " + WORKSPACE)
print("Workspace storage bucket: " + bucket)

# Function Definitions

In [None]:
def rtn_datatype_ls_for_subm(bucket_url, submission_key, data_type_subdir, file_type_ls):
    '''Takes in:
            * bucket_url (string): url of bucket to search (should be the AnVIL_HPRC bucket)
                ex: "gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/"
            * submission_key (string): the UUID plus the submission name to search
                ex: "5c68b972-8534-402f-9861-11c93558765f--UW_HPRC_HiFi_Y3"
            * data_type_subdir (string): name of the subfolder to search 
              (used when a submission has more than one data type.)
                ex: "PacBio_HiFi" if the data is not in subfolders, pass "."
            * file_type_ls (list of strings): file extensions to search for. Often a submission will 
              have more than one type of file that represents the same dataset. Be careful to not 
              include replicate data, however.
                ex: ".hifi_reads.bam"
     then performs a list of the bucket, then returns a filtered list files.'''
    
    rtn_file_ls = []
    
    submission_path = str(bucket_url + "submissions/" + submission_key)
    file_list_byt   = subprocess.run(['gsutil', '-u', 'firecloud-cgl', 'ls', '-r', submission_path], 
                                     stdout=subprocess.PIPE)

    file_list_str   = file_list_byt.stdout.decode('utf-8')
    file_list       = file_list_str.split('\n')  ## Pull out "\n"
   
    ## filter out empty strings
    file_list = [ elem for elem in file_list if elem != '']
    
    ## Pull from subdir type we are targeting
    file_list = list(filter(lambda x:re.search(rf"{data_type_subdir}", x), file_list))
    
    for file_type in file_type_ls:
    
        ## Pull files of correct type (ex: ccs.bam)
        file_list_by_type = list(filter(lambda x:re.search(rf"{file_type}$", x), file_list))

        ## Add to list of files to return
        rtn_file_ls += file_list_by_type

    return rtn_file_ls    

# Read In Sample Names

In [None]:
## This file should be stored in the Terra workspace (in the bucket) that is being used for 
## the submission you  are wrangling...
sample_info_fp = "gs://fc-0de5e548-01c6-4195-a98b-ae7a1953688f/sample_info/UW_HPRC_HiFi_Y3.csv"

sample_df = pd.read_csv(sample_info_fp, header=None)

sample_df.rename(columns = {0:'sample_id'}, inplace = True)

In [None]:
## set the sample name to be the index (this is what we want each row to 
## use as a key -- of sorts -- in Terra)
sample_df.set_index('sample_id', inplace=True)

# Create Dataframe Of Files

## Find the data

In [None]:
subm_key         = "5c68b972-8534-402f-9861-11c93558765f--UW_HPRC_HiFi_Y3"
data_type_subdir = "."
file_type_ls     = [".hifi_reads.bam"]

In [None]:
## get a list of the files in submission (subm_key) that match the 
## data_type_subdir and file_type_ls
file_ls  = rtn_datatype_ls_for_subm(anvil_hprc_bucket, subm_key, 
                                      data_type_subdir, file_type_ls)

In [None]:
## Check that the number of files matches what we expect
len(file_ls)

## Add Each Sample's Data To Sample Data Frame

In [None]:
## Prepare data frame to hold data
sample_df['hifi'] = ''
sample_df['hifi']  = sample_df['hifi'].astype('object')  

In [None]:
for index, row in sample_df.iterrows():
    sample_id = row.name
    
    sample_file_ls = list(filter(lambda x:re.search(rf"{sample_id}", x), file_ls))
    sample_df.at[index, "hifi"] = sample_file_ls

In [None]:
## take a look to make sure it looks as expected...
sample_df

## Add 1000G data (for NTSM run)

In [None]:
## Read in 1000G data from another workspace
## We will be using this to compare against out submission to make sure that
## the data comes from the same samples
thousand_genomes_df = table_to_dataframe("sample", 
                                        workspace="1000G-high-coverage-2019", 
                                        workspace_namespace="anvil-datastorage")

In [None]:
## Use the 1kg library name (i.e. HG00621) as the index (that matches our sample name)
thousand_genomes_df.set_index('library_name', inplace=True)

## We only need the cram file (represents 30X Ilmn dataset)
thousand_genomes_df = thousand_genomes_df[['cram']]

## name the column to be a bit more descriptive
thousand_genomes_df.rename(columns = {'cram':'1000g_cram'}, inplace = True)

In [None]:
## merge the two dataframes
ntsm_df = pd.merge(
    sample_df,
    thousand_genomes_df,
    left_index=True,
    right_index=True)

# Upload To Tables

In [None]:
## Create tables for running NTSM and ReadStats
dataframe_to_table("ntsm",      ntsm_df, WORKSPACE, PROJECT)
dataframe_to_table("readstats", sample_df, WORKSPACE, PROJECT)