# Notebook 14: Pull Genbank Assemblies For Upload<a class="tocSkip">

**The final assemblies are up on Genbank, I will download them, put them in this workspace's bucket, then create a data table so we can run a workflow to change the header names and remove the soft masking.**
    
    
**The steps that we will take are:**
1. Import Statements & Global Variable Definitions
2. Download NCBI Assemblies
3. Create Data Frame Of Assemblies
4. Copy Assemblies To Bucket
5. Upload To Data Table

# Import Statements & Global Variable Definitions

## Load Python packages
----

In [1]:
%%capture 
import terra_notebook_utils as tnu
import terra_pandas as tp
import os
import io
import gzip
import pandas as pd
import numpy as np

## Set Environment Variables

In [2]:
# Get the Google billing project name and workspace name
PROJECT = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE = os.path.basename(os.path.dirname(os.getcwd()))
bucket = os.environ['WORKSPACE_BUCKET'] + "/"

# Verify that we've captured the environment variables
print("Billing project: " + PROJECT)
print("Workspace: " + WORKSPACE)
print("Workspace storage bucket: " + bucket)

Billing project: human-pangenome-ucsc
Workspace: HPRC_Reassembly
Workspace storage bucket: gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/


# Download NCBI Assemblies
## Install NCBI Datasets Tool

In [3]:
! wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets

--2021-06-21 02:31:36--  https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.229, 130.14.250.13, 2607:f220:41e:250::12, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.229|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13075642 (12M)
Saving to: ‘datasets’


2021-06-21 02:31:37 (44.4 MB/s) - ‘datasets’ saved [13075642/13075642]



In [4]:
! chmod 777 datasets

## Pull Dehydrated Zip File
File is all of the assemblies in a dehydrated form from the NCBI datasets page. Need to unzip it then rehydrate

In [3]:
! mkdir final_assemblies
%cd final_assemblies

mkdir: cannot create directory ‘final_assemblies’: File exists
/home/jupyter-user/notebooks/HPRC_Reassembly/edit/final_assemblies


In [11]:
! gsutil cp gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/genbank_assemblies/2021_06_20_Year1_Genbank_Assemblies_Dehydrated.zip ./

Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/genbank_assemblies/2021_06_20_Year1_Genbank_Assemblies_Dehydrated.zip...
/ [1 files][ 24.1 KiB/ 24.1 KiB]                                                
Operation completed over 1 objects/24.1 KiB.                                     


In [19]:
! unzip 2021_06_20_Year1_Genbank_Assemblies_Dehydrated.zip -d y1_final_genbank

Archive:  2021_06_20_Year1_Genbank_Assemblies_Dehydrated.zip
  inflating: y1_final_genbank/README.md  
  inflating: y1_final_genbank/ncbi_dataset/data/data_summary.tsv  
  inflating: y1_final_genbank/ncbi_dataset/data/assembly_data_report.jsonl  
  inflating: y1_final_genbank/ncbi_dataset/fetch.txt  
  inflating: y1_final_genbank/ncbi_dataset/data/dataset_catalog.json  


**Actually rehydrate the data**<br>
*Takes a few hours*

In [13]:
%%capture
! /home/jupyter-user/notebooks/HPRC_Reassembly/edit/datasets \
    rehydrate \
    --directory y1_final_genbank

In [4]:
## PWD to show that last step finished ok
! pwd

/home/jupyter-user/notebooks/HPRC_Reassembly/edit/final_assemblies


# Create Data Frame Of Assemblies

In [5]:
original_file_ls = ! find . -type f -name '*.fna' 
original_file_ls

['./y1_final_genbank/ncbi_dataset/data/GCA_018469695.1/GCA_018469695.1_HG01123.alt.pat.f1_v2.1_genomic.fna',
 './y1_final_genbank/ncbi_dataset/data/GCA_018503275.1/GCA_018503275.1_NA19240.pri.mat.f1_v2_genomic.fna',
 './y1_final_genbank/ncbi_dataset/data/GCA_018504365.1/GCA_018504365.1_HG01109.pri.mat.f1_v2_genomic.fna',
 './y1_final_genbank/ncbi_dataset/data/GCA_018506155.1/GCA_018506155.1_HG03098.alt.pat.f1_v2_genomic.fna',
 './y1_final_genbank/ncbi_dataset/data/GCA_018470445.1/GCA_018470445.1_HG02572.pri.mat.f1_v2_genomic.fna',
 './y1_final_genbank/ncbi_dataset/data/GCA_018503245.1/GCA_018503245.1_HG03486.alt.pat.f1_v2_genomic.fna',
 './y1_final_genbank/ncbi_dataset/data/GCA_018473305.1/GCA_018473305.1_HG03453.alt.pat.f1_v2_genomic.fna',
 './y1_final_genbank/ncbi_dataset/data/GCA_018473295.1/GCA_018473295.1_HG03540.pri.mat.f1_v2_genomic.fna',
 './y1_final_genbank/ncbi_dataset/data/GCA_018506975.1/GCA_018506975.1_HG00733.pri.mat.f1_v2_genomic.fna',
 './y1_final_genbank/ncbi_dataset/d

**Create Data Frame, and Pull The Sample Name And Haplotype From Original File Name**

In [6]:
assembly_df = pd.DataFrame(original_file_ls, columns=['original_fasta_loc']) 

In [7]:
file_name = assembly_df['original_fasta_loc'].str.split("/", expand=True)[5]

In [8]:
sample_haplotype = file_name.str.split("_", expand=True)[2]

In [9]:
assembly_df['sample'] = sample_haplotype.str.split(".", expand=True)[0]

In [10]:
assembly_df['haplotype'] = sample_haplotype.str.split(".", expand=True)[2]

In [11]:
assembly_df['mat_pat_int'] =  np.where(assembly_df['haplotype'] == "pat", 1, 2)

In [12]:
print(assembly_df)

                                   original_fasta_loc   sample haplotype  \
0   ./y1_final_genbank/ncbi_dataset/data/GCA_01846...  HG01123       pat   
1   ./y1_final_genbank/ncbi_dataset/data/GCA_01850...  NA19240       mat   
2   ./y1_final_genbank/ncbi_dataset/data/GCA_01850...  HG01109       mat   
3   ./y1_final_genbank/ncbi_dataset/data/GCA_01850...  HG03098       pat   
4   ./y1_final_genbank/ncbi_dataset/data/GCA_01847...  HG02572       mat   
..                                                ...      ...       ...   
89  ./y1_final_genbank/ncbi_dataset/data/GCA_01846...  HG01258       mat   
90  ./y1_final_genbank/ncbi_dataset/data/GCA_01846...  HG02257       mat   
91  ./y1_final_genbank/ncbi_dataset/data/GCA_01846...  HG01891       mat   
92  ./y1_final_genbank/ncbi_dataset/data/GCA_01847...  HG00673       pat   
93  ./y1_final_genbank/ncbi_dataset/data/GCA_01847...  HG01978       pat   

    mat_pat_int  
0             1  
1             2  
2             2  
3             1

# Copy Assemblies To Bucket

In [129]:
gcp_raw_assemblies_fp = f"{bucket}genbank_assemblies/raw"

In [132]:
! gsutil -m cp -r y1_final_genbank/ncbi_dataset/data/ {gcp_raw_assemblies_fp}

Copying file://y1_final_genbank/ncbi_dataset/data/assembly_data_report.jsonl [Content-Type=application/octet-stream]...
Copying file://y1_final_genbank/ncbi_dataset/data/dataset_catalog.json [Content-Type=application/json]...
Copying file://y1_final_genbank/ncbi_dataset/data/data_summary.tsv [Content-Type=text/tab-separated-values]...
Copying file://y1_final_genbank/ncbi_dataset/data/GCA_018503275.1/GCA_018503275.1_NA19240.pri.mat.f1_v2_genomic.fna [Content-Type=application/octet-stream]...
Copying file://y1_final_genbank/ncbi_dataset/data/GCA_018469695.1/GCA_018469695.1_HG01123.alt.pat.f1_v2.1_genomic.fna [Content-Type=application/octet-stream]...
Copying file://y1_final_genbank/ncbi_dataset/data/GCA_018506155.1/GCA_018506155.1_HG03098.alt.pat.f1_v2_genomic.fna [Content-Type=application/octet-stream]...
Copying file://y1_final_genbank/ncbi_dataset/data/GCA_018503275.1/sequence_report.jsonl [Content-Type=application/octet-stream]...
Copying file://y1_final_genbank/ncbi_dataset/data/GCA

Copying file://y1_final_genbank/ncbi_dataset/data/GCA_018471515.1/sequence_report.jsonl [Content-Type=application/octet-stream]...
Copying file://y1_final_genbank/ncbi_dataset/data/GCA_018503585.1/GCA_018503585.1_HG02818.pri.mat.f1_v2_genomic.fna [Content-Type=application/octet-stream]...
Copying file://y1_final_genbank/ncbi_dataset/data/GCA_018503585.1/sequence_report.jsonl [Content-Type=application/octet-stream]...
Copying file://y1_final_genbank/ncbi_dataset/data/GCA_018469945.1/GCA_018469945.1_HG02630.alt.pat.f1_v2_genomic.fna [Content-Type=application/octet-stream]...
Copying file://y1_final_genbank/ncbi_dataset/data/GCA_018469945.1/sequence_report.jsonl [Content-Type=application/octet-stream]...
Copying file://y1_final_genbank/ncbi_dataset/data/GCA_018472595.1/sequence_report.jsonl [Content-Type=application/octet-stream]...
Copying file://y1_final_genbank/ncbi_dataset/data/GCA_018472595.1/GCA_018472595.1_HG00438.alt.pat.f1_v2_genomic.fna [Content-Type=application/octet-stream]...

**Set location in GCP**

In [44]:
assembly_df['original_fasta_gcp_loc'] = assembly_df['original_fasta_loc'].str.replace('./y1_final_genbank/ncbi_dataset/data/', 
                                                   'gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/genbank_assemblies/raw/',
                                                    regex=False)

# Upload To Data Table

**Have data frame in long format, create mat/pat dataframes then merge**

In [60]:
mat_df = assembly_df[assembly_df['mat_pat_int'] == 2]
pat_df = assembly_df[assembly_df['mat_pat_int'] == 1]

In [69]:
upload_df = mat_df.merge(pat_df, left_on='sample', right_on='sample', suffixes=('_mat', '_pat'))

In [70]:
upload_df

Unnamed: 0,original_fasta_loc_mat,sample,haplotype_mat,mat_pat_int_mat,original_fasta_gcp_loc_mat,original_fasta_loc_pat,haplotype_pat,mat_pat_int_pat,original_fasta_gcp_loc_pat
0,./y1_final_genbank/ncbi_dataset/data/GCA_01850...,NA19240,mat,2,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...,./y1_final_genbank/ncbi_dataset/data/GCA_01850...,pat,1,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...
1,./y1_final_genbank/ncbi_dataset/data/GCA_01850...,HG01109,mat,2,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...,./y1_final_genbank/ncbi_dataset/data/GCA_01850...,pat,1,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...
2,./y1_final_genbank/ncbi_dataset/data/GCA_01847...,HG02572,mat,2,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...,./y1_final_genbank/ncbi_dataset/data/GCA_01847...,pat,1,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...
3,./y1_final_genbank/ncbi_dataset/data/GCA_01847...,HG03540,mat,2,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...,./y1_final_genbank/ncbi_dataset/data/GCA_01847...,pat,1,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...
4,./y1_final_genbank/ncbi_dataset/data/GCA_01850...,HG00733,mat,2,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...,./y1_final_genbank/ncbi_dataset/data/GCA_01850...,pat,1,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...
5,./y1_final_genbank/ncbi_dataset/data/GCA_01847...,HG03579,mat,2,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...,./y1_final_genbank/ncbi_dataset/data/GCA_01847...,pat,1,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...
6,./y1_final_genbank/ncbi_dataset/data/GCA_01850...,HG005,mat,2,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...,./y1_final_genbank/ncbi_dataset/data/GCA_01850...,pat,1,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...
7,./y1_final_genbank/ncbi_dataset/data/GCA_01846...,HG03516,mat,2,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...,./y1_final_genbank/ncbi_dataset/data/GCA_01846...,pat,1,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...
8,./y1_final_genbank/ncbi_dataset/data/GCA_01847...,HG00621,mat,2,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...,./y1_final_genbank/ncbi_dataset/data/GCA_01847...,pat,1,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...
9,./y1_final_genbank/ncbi_dataset/data/GCA_01847...,HG01175,mat,2,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...,./y1_final_genbank/ncbi_dataset/data/GCA_01847...,pat,1,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/g...


**Get rid of unnecessary columns**

In [71]:
upload_df = upload_df.drop(['original_fasta_loc_mat', 'original_fasta_loc_pat', 
                            'haplotype_mat', 'haplotype_pat',
                            'mat_pat_int_mat', 'mat_pat_int_pat'], 1)

**Upload To Data Table**

In [72]:
upload_df = upload_df.set_index('sample')

In [73]:
tp.dataframe_to_table("raw_genbank_sample", upload_df)