# Update AnVIL_HPRC *sample* Data Table<a class="tocSkip">

**This notebook automatically reads in data stored in the AnVIL_HPRC workspace and generates a data table with the relevant sequencing information**

**Note that this notebook requires the following inputs**
1. Pedigree file: maps child ID to maternal and paternal IDs. Also used to pull the sample ID from the file key.

**Below are the steps taken in this notebook:**
1. Import Statements & Global Variable Definitions
2. Define Functions
3. Read In Input Files
4. Loop Through Samples & Compile Data
5. Write data frame to data table

# Import Statements & Global Variable Definitions

## Import Statements

In [7]:
%%capture
%pip install gcsfs

In [24]:
%%capture
%pip install --upgrade --no-cache-dir --force-reinstall terra-pandas
%pip install --upgrade --no-cache-dir  --force-reinstall git+https://github.com/DataBiosphere/terra-notebook-utils

In [1]:
from firecloud import fiss
import pandas as pd         
import os                 
import subprocess       
import re                 
import io

from typing import Any, Callable, List, Optional
from terra_notebook_utils import table, WORKSPACE_NAME, WORKSPACE_GOOGLE_PROJECT
import terra_pandas as tp

## Global Variable Definitions

**Files To Read**

In [2]:
pedigree_fn    = "2021_02_17_anvil_hprc_pedigree.csv"

**Environment Variables**

In [3]:
# AnVIL_HPRC WorkspaceBucket
anvil_hprc_bucket       = "gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/"
current_workpace_bucket = "gs://fc-cba31066-5983-4306-b66e-bdcfb644fb32/"

# Get the Google billing project name and workspace name for current workspace
PROJECT = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE =os.path.basename(os.path.dirname(os.getcwd()))
bucket = os.environ['WORKSPACE_BUCKET'] + "/"


# Verify that we've captured the environment variables
print("Billing project: " + PROJECT)
print("Workspace: " + WORKSPACE)
print("Workspace storage bucket: " + bucket)

Billing project: human-pangenome-ucsc
Workspace: AnVIL_HPRC_Data_Transfer
Workspace storage bucket: gs://fc-cba31066-5983-4306-b66e-bdcfb644fb32/


**Allow pandas to hold really large strings**

In [4]:
## Currently set to ensure we can capture all Strandseq data
## Strandseq has 192 files/sample * ~200 characters / file = 40,000 characters/sample!
pd.options.display.max_colwidth = 900000

# Define Functions

**Pull List Of Files For Sample x File-Type**

In [5]:
def rtn_datatype_ls_for_sample(bucket_url, sample_id, data_type_subdir, file_type_ls):
    
    ## Find correct subdir to look under 
    hprc_subdir = "HPRC_PLUS"
    if(sample_id in hprc_samples_ls): 
        hprc_subdir = "HPRC"
    
    rtn_file_ls = []
    
    dir_path = str(bucket_url + "working/" + hprc_subdir + "/" + 
                   sample_id + "/raw_data/" + data_type_subdir)

    file_list_byt   = subprocess.run(['gsutil', '-u', 'firecloud-cgl', 'ls', '-r', dir_path], 
                                     stdout=subprocess.PIPE)

    file_list_str   = file_list_byt.stdout.decode('utf-8')
    file_list       = file_list_str.split('\n')  ## Pull out "\n"

    ## filter out empty strings
    file_list = [ elem for elem in file_list if elem != '']

    for file_type in file_type_ls:
    
        ## Pull files of correct type (ex: ccs.bam)
        file_list_by_type = list(filter(lambda x:re.search(rf"{file_type}$", x), file_list))

        ## Add to list of files to return
        rtn_file_ls += file_list_by_type

    return rtn_file_ls    

# Read In Input Files

**Pedigree File**

In [6]:
pedigree_fp = str(current_workpace_bucket + pedigree_fn)

pedigree_df = pd.read_csv(pedigree_fp)
pedigree_df.rename(columns = {'seq_sample_id':'sample_id'}, inplace = True)
pedigree_df.head()

Unnamed: 0,sample_id,paternal_id,maternal_id,cohort,notes
0,HG01891,HG01890,HG01889,HPRC,
1,HG02486,HG02484,HG02485,HPRC,
2,HG02559,HG02557,HG02558,HPRC,
3,HG01888,HG01882,HG01883,HPRC,Abnormal Karyotype
4,HG02257,HG02255,HG02256,HPRC,


**Get list of HPRC Samples (to split out from HPRC_PLUS)**

In [7]:
hprc_samples_ls = pedigree_df[pedigree_df['cohort'] == "HPRC"]['sample_id'].tolist()

# Loop Through Samples & Compile Data

In [8]:
## create data frame to hold all of our info for the data table
sample_info_df = pedigree_df.copy()

**Set columns to empty objects so Pandas will hold lists in cells**

In [9]:
sample_info_df.head()

Unnamed: 0,sample_id,paternal_id,maternal_id,cohort,notes
0,HG01891,HG01890,HG01889,HPRC,
1,HG02486,HG02484,HG02485,HPRC,
2,HG02559,HG02557,HG02558,HPRC,
3,HG01888,HG01882,HG01883,HPRC,Abnormal Karyotype
4,HG02257,HG02255,HG02256,HPRC,


In [10]:
sample_info_df['hifi']         = ''
sample_info_df['hifi']         = sample_info_df['hifi'].astype('object')

sample_info_df['hic']          = ''
sample_info_df['hic']          = sample_info_df['hic'].astype('object')

sample_info_df['nanopore']     = ''
sample_info_df['nanopore']     = sample_info_df['nanopore'].astype('object')    

sample_info_df['bionano_cmap'] = ''
sample_info_df['bionano_cmap'] = sample_info_df['bionano_cmap'].astype('object')  

sample_info_df['bionano_bnx']  = ''
sample_info_df['bionano_bnx']  = sample_info_df['bionano_bnx'].astype('object')  

sample_info_df['child_ilmn']   = ''
sample_info_df['child_ilmn']   = sample_info_df['child_ilmn'].astype('object') 

sample_info_df['mat_ilmn']     = ''
sample_info_df['mat_ilmn']     = sample_info_df['mat_ilmn'].astype('object') 

sample_info_df['pat_ilmn']     = ''
sample_info_df['pat_ilmn']     = sample_info_df['pat_ilmn'].astype('object') 

sample_info_df['strandseq']    = ''
sample_info_df['strandseq']    = sample_info_df['strandseq'].astype('object') 

In [11]:
sample_info_df.set_index('sample_id', inplace=True)

In [12]:
## Loop through rows (samples)
for index, row in sample_info_df.iterrows():
    
    sample_id = row.name
    print(f"Extract sample {index}")
    
    ## pull PacBio HiFi files
    hifi_file_ls = rtn_datatype_ls_for_sample(anvil_hprc_bucket, sample_id, "PacBio_HiFi*", ["ccs.bam", "fastq", "fastq.gz"])
    sample_info_df.at[index, "hifi"] = hifi_file_ls

    ## pull Hi-C/OmniC files
    hic_file_ls = rtn_datatype_ls_for_sample(anvil_hprc_bucket, sample_id, "hic", ["fastq.gz"])
    sample_info_df.at[index, "hic"] = hic_file_ls
    
    ## pull ONT fastq files (not fast5 signal files)
    ont_file_ls = rtn_datatype_ls_for_sample(anvil_hprc_bucket, sample_id, "nanopore", ["fastq.gz"])
    sample_info_df.at[index, "nanopore"] = ont_file_ls

    ## pull bionano cmap files
    bnc_file_ls = rtn_datatype_ls_for_sample(anvil_hprc_bucket, sample_id, "bionano", ["cmap"])
    sample_info_df.at[index, "bionano_cmap"] = bnc_file_ls

    ## pull bionano bnx files
    bnx_file_ls = rtn_datatype_ls_for_sample(anvil_hprc_bucket, sample_id, "bionano", ["bnx.gz"])
    sample_info_df.at[index, "bionano_bnx"] = bnx_file_ls

    ## pull strandseq files
    strs_file_ls = rtn_datatype_ls_for_sample(anvil_hprc_bucket, sample_id, "Strand_seq", ["fastq.gz", "txt.gz"])
    sample_info_df.at[index, "strandseq"] = strs_file_ls

    ## pull child Illumina data
    ilmn_file_ls = rtn_datatype_ls_for_sample(anvil_hprc_bucket, sample_id, "Illumina/child", ["fastq.gz", "cram"])
    sample_info_df.at[index, "child_ilmn"] = ilmn_file_ls


    ## pull parents Illumina data
    par_file_ls = rtn_datatype_ls_for_sample(anvil_hprc_bucket, sample_id, "Illumina/parents", ["fastq.gz", "cram"])

    ## Set maternal
    mat_id      = sample_info_df[sample_info_df.index == sample_id]['maternal_id'].values[0]
    mat_file_ls = list(filter(lambda x:mat_id in x, par_file_ls))
    sample_info_df.at[index, "mat_ilmn"] = mat_file_ls

    ## Set paternal
    pat_id      = sample_info_df[sample_info_df.index == sample_id]['paternal_id'].values[0]
    pat_file_ls = list(filter(lambda x:pat_id in x, par_file_ls))
    sample_info_df.at[index, "pat_ilmn"] = pat_file_ls

Extract sample HG01891
Extract sample HG02486
Extract sample HG02559
Extract sample HG01888
Extract sample HG02257
Extract sample HG01138
Extract sample HG01358
Extract sample HG01123
Extract sample HG01361
Extract sample HG01258
Extract sample HG03516
Extract sample HG02572
Extract sample HG02886
Extract sample HG02717
Extract sample HG02630
Extract sample HG02622
Extract sample HG03540
Extract sample HG03453
Extract sample HG03579
Extract sample HG03471
Extract sample HG01978
Extract sample HG01928
Extract sample HG02148
Extract sample HG01998
Extract sample HG01952
Extract sample HG01106
Extract sample HG01175
Extract sample HG00741
Extract sample HG00735
Extract sample HG01071
Extract sample HG00480
Extract sample HG00621
Extract sample HG00438
Extract sample HG00673
Extract sample HG02723
Extract sample HG02818
Extract sample HG02970
Extract sample HG03486
Extract sample NA18906
Extract sample NA19030
Extract sample NA19240
Extract sample NA20129
Extract sample NA20300
Extract sam

# Write data frame to data table

## Upload To Current Workspace (Just To Check)

In [47]:
# sample_info_df = sample_info_df.rename(index={'sample_id': 'sample'})
# sample_info_df.index.names = ["sample"]

In [15]:
tp.dataframe_to_table("sample", sample_info_df, WORKSPACE, PROJECT)

## Upload To QC Workspace (If Neccesary)

In [14]:
HPRC_QC_PROJECT   = "human-pangenome-ucsc"
HPRC_QC_WORKSPACE = "HPRC_QC"

In [16]:
tp.dataframe_to_table("y1_sample_update", sample_info_df, HPRC_QC_WORKSPACE, HPRC_QC_PROJECT)

## Upload To AnVIL_HPRC Workspace

In [36]:
ANVIL_HPRC_PROJECT   = "anvil-datastorage"
ANVIL_HPRC_WORKSPACE = "AnVIL_HPRC"

In [35]:
tp.dataframe_to_table("sample", sample_info_df, ANVIL_HPRC_WORKSPACE, ANVIL_HPRC_PROJECT)

NameError: name 'ANVIL_HPRC_WORKSPACE' is not defined