# Notebook 12: HG002 Contamination Fix For Genbank<a class="tocSkip">

**HG002's maternal assembly needs to have a contig dropped and it needs to have some (more) adapter sequence masked. Genbank has asked us to do this (they don't mask adapters).*
    
    
**The steps that we will take are:**
1. Import Statements & Global Variable Definitions
2. Download HG002 Maternal Files
3. Drop EBV Contig
4. Mask Adapter Sequence
5. Fix Headers
6. Copy Final FASTA To Bucket

# Import Statements & Global Variable Definitions

## Load Python packages
----

In [1]:
%%capture 
import terra_notebook_utils as tnu
import terra_pandas as tp
import os
import io
import gzip
import pandas as pd
import numpy as np
from Bio import SeqIO
from Bio.Seq import Seq, Alphabet

## Set Environment Variables

In [2]:
# Get the Google billing project name and workspace name
PROJECT = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE = os.path.basename(os.path.dirname(os.getcwd()))
bucket = os.environ['WORKSPACE_BUCKET'] + "/"

# Verify that we've captured the environment variables
print("Billing project: " + PROJECT)
print("Workspace: " + WORKSPACE)
print("Workspace storage bucket: " + bucket)

Billing project: human-pangenome-ucsc
Workspace: HPRC_Reassembly
Workspace storage bucket: gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/


# Download HG002 Maternal Files



In [3]:
! mkdir HG002_Genbank
%cd HG002_Genbank

mkdir: cannot create directory ‘HG002_Genbank’: File exists
/home/jupyter-user/notebooks/HPRC_Reassembly/edit/HG002_Genbank


## Download partially fixed assembly + remaining contamination

In [8]:
! wget \
    https://submit.ncbi.nlm.nih.gov/api/2.0/files/8jtviypy/foreigncontaminationmodified_hg002_maternal_f1_assembly_v2_1_genbank.zip/?format=attachment \
    --output-document foreigncontaminationmodified_hg002_maternal_f1_assembly_v2_1_genbank.zip 

--2021-05-27 16:39:23--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/8jtviypy/foreigncontaminationmodified_hg002_maternal_f1_assembly_v2_1_genbank.zip/?format=attachment
Resolving submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)... 130.14.29.113, 2607:f220:41e:4290::113
Connecting to submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)|130.14.29.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 885141678 (844M) [application/zip]
Saving to: ‘foreigncontaminationmodified_hg002_maternal_f1_assembly_v2_1_genbank.zip’


2021-05-27 16:39:46 (38.2 MB/s) - ‘foreigncontaminationmodified_hg002_maternal_f1_assembly_v2_1_genbank.zip’ saved [885141678/885141678]



In [9]:
! wget \
    https://submit.ncbi.nlm.nih.gov/api/2.0/files/uuwykxej/remainingcontamination_hg002_maternal_f1_assembly_v2_1_genbank.txt/?format=attachment \
    --output-document remainingcontamination_hg002_maternal_f1_assembly_v2_1_genbank.txt

--2021-05-27 16:39:54--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/uuwykxej/remainingcontamination_hg002_maternal_f1_assembly_v2_1_genbank.txt/?format=attachment
Resolving submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)... 130.14.29.113, 2607:f220:41e:4290::113
Connecting to submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)|130.14.29.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4640 (4.5K) [text/plain]
Saving to: ‘remainingcontamination_hg002_maternal_f1_assembly_v2_1_genbank.txt’


2021-05-27 16:39:54 (4.28 MB/s) - ‘remainingcontamination_hg002_maternal_f1_assembly_v2_1_genbank.txt’ saved [4640/4640]



In [13]:
! unzip foreigncontaminationmodified_hg002_maternal_f1_assembly_v2_1_genbank.zip

Archive:  foreigncontaminationmodified_hg002_maternal_f1_assembly_v2_1_genbank.zip
  inflating: fasta/HG002_maternal_f1_assembly_v2_1_genbank00000000.fsa  


In [15]:
! ls -lh fasta

total 2.9G
-rw-rw-r-- 1 jupyter-user users 2.9G May 11 11:40 HG002_maternal_f1_assembly_v2_1_genbank00000000.fsa


## Take a look at what we need to fix

In [18]:
! tail -n 15 *.txt

(9 split spans with locations to mask/trim)

Trim:
Sequence name, length, span(s), apparent source
HG002#2#h2tg000001l	110635364	38657030..38657078	adaptor:NGB00972.1-not_cleaned
HG002#2#h2tg000004l	111658246	93366436..93366567	mitochondrion-not_cleaned
HG002#2#h2tg000037l	47503194	32469744..32470038	mitochondrion-not_cleaned
HG002#2#h2tg000051l	86491595	72699959..72700004	adaptor:NGB00972.1-not_cleaned
HG002#2#h2tg000077l	57225567	4930882..4931137	mitochondrion-not_cleaned
HG002#2#h2tg000115l	55119985	32266994..32267188	mitochondrion-not_cleaned
HG002#2#h2tg000128l	4104114	934938..935807	mitochondrion-not_cleaned
HG002#2#h2tg000180l	870869	64581..70422	mitochondrion-not_cleaned
HG002#2#h2tg000535l	45111	11417..45110	Human gammaherpesvirus 4-not_cleaned




**So we need to mask these regions:**<br>
HG002#2#h2tg000001l	110635364	38657030..38657078	adaptor:NGB00972.1-not_cleaned <br>
HG002#2#h2tg000051l	86491595	72699959..72700004	adaptor:NGB00972.1-not_cleaned <br>

**And we need to remove this contig**<br>
HG002#2#h2tg000535l	45111	11417..45110	Human gammaherpesvirus 4-not_cleaned

*Note that I have already blasted 1-10000 of this contig and found it to also be EBV, so Shelby and I agreed to drop the whole thing...*

# Drop EBV Contig
**HG002#2#h2tg000535l**

In [5]:
%cd fasta

/home/jupyter-user/notebooks/HPRC_Reassembly/edit/HG002_Genbank/fasta


## Drop Contig

In [25]:
! samtools faidx HG002_maternal_f1_assembly_v2_1_genbank00000000.fsa

In [27]:
! mkdir dropped

In [29]:
! cat HG002_maternal_f1_assembly_v2_1_genbank00000000.fsa.fai \
    | cut -f1 | grep -v 'HG002#2#h2tg000535l' \
    > dropped/HG002.mat.goodcontigs.txt

In [36]:
## Check that it worked
! cat dropped/HG002.mat.goodcontigs.txt | wc -l
! cat HG002_maternal_f1_assembly_v2_1_genbank00000000.fsa.fai | wc -l

445
446


Actually drop the contig

In [37]:
! samtools faidx \
    HG002_maternal_f1_assembly_v2_1_genbank00000000.fsa \
    `cat dropped/HG002.mat.goodcontigs.txt` \
    > dropped/HG002.mat.dropped.fa

## Check Our Work

In [38]:
! grep -v ">" HG002_maternal_f1_assembly_v2_1_genbank00000000.fsa | wc | awk '{print $3-$1}'

3.06065e+09


In [39]:
! grep -v ">" dropped/HG002.mat.dropped.fa | wc | awk '{print $3-$1}'

3.06061e+09


In [40]:
! samtools faidx dropped/HG002.mat.dropped.fa

In [41]:
! cat dropped/HG002.mat.dropped.fa.fai | wc -l

445


**Size and number of contigs looks good**

# Mask Adapter Sequence
## Pull Adapter Sequence To Check
HG002#2#h2tg000001l 110635364 38657030..38657078 adaptor:NGB00972.1-not_cleaned <br>
HG002#2#h2tg000051l 86491595 72699959..72700004 adaptor:NGB00972.1-not_cleaned <br>

In [50]:
! samtools faidx \
    dropped/HG002.mat.dropped.fa \
    'lcl|HG002#2#h2tg000001l:38657030-38657078'

>lcl|HG002#2#h2tg000001l:38657030-38657078
ATCTCTCTCAAACAACAACAACGGGAGGAGAGGAAAAGAGAGAGATAAC


In [52]:
## Compare against SMRTbell Dimer

# TCTCTCAACAACAACAACGGAGG-AGGAGGAAAAGAGAGAGATATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAGAT
# ATCTCTCTCAAACAACAACAACGGGAGGAGAGGAAAAGAGAGAGATAAC

In [51]:
! samtools faidx \
    dropped/HG002.mat.dropped.fa \
    'lcl|HG002#2#h2tg000051l:72699959-72700004'

>lcl|HG002#2#h2tg000051l:72699959-72700004
ATCTCTCTCAACAACATCAACGGAGGAGGAGGGAAAAGAGAGAGAT


In [None]:
## Compare against SMRTbell dimer

#  TCTCT--CAACAACAACAACGGAGGAGGAGG-AAAAGAGAGAGATATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAGAT
# ATCTCTCTCAACAACATCAACGGAGGAGGAGGGAAAAGAGAGAGAT

In [60]:
! mkdir mask

! echo 'lcl|HG002#2#h2tg000001l\t38657030\t38657078\tadaptor:NGB00972.1-not_cleaned' > mask/HG002_mat_mask.bed
! echo 'lcl|HG002#2#h2tg000051l\t72699959\t72700004\tadaptor:NGB00972.1-not_cleaned' >> mask/HG002_mat_mask.bed

mkdir: cannot create directory ‘mask’: File exists


In [6]:
! cat mask/HG002_mat_mask.bed

lcl|HG002#2#h2tg000001l	38657030	38657078	adaptor:NGB00972.1-not_cleaned
lcl|HG002#2#h2tg000051l	72699959	72700004	adaptor:NGB00972.1-not_cleaned


## Mask Adapters

**Install BedTools**

In [55]:
! wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools.static.binary
    
! mv bedtools.static.binary bedtools
! chmod a+x bedtools

--2021-05-27 22:50:22--  https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools.static.binary
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/15059334/c633cf80-61f8-11eb-92ef-18b90dff37e2?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210527%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210527T225022Z&X-Amz-Expires=300&X-Amz-Signature=559be6e2cffed8c5d40d717177867c4ef83405fbb3ba618dd61d87d93786c4e4&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=15059334&response-content-disposition=attachment%3B%20filename%3Dbedtools.static.binary&response-content-type=application%2Foctet-stream [following]
--2021-05-27 22:50:22--  https://github-releases.githubusercontent.com/15059334/c633cf80-61f8-11eb-92ef-18b90dff37e2?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJY

In [62]:
! ./bedtools maskfasta \
    -fi dropped/HG002.mat.dropped.fa \
    -bed mask/HG002_mat_mask.bed \
    -fo mask/HG002.mat.dropped.masked.fa

## Check What We Did


In [13]:
! cat mask/HG002.mat.dropped.masked.fa | grep -v 'lcl|' | grep -o "N" | wc -l
! cat dropped/HG002.mat.dropped.fa | grep -v 'lcl|' | grep -o "N" | wc -l
! cat HG002_maternal_f1_assembly_v2_1_genbank00000000.fsa | grep -v 'lcl|' | grep -o "N" | wc -l

419
326
326


**Take a look at the headers**

In [19]:
! cat mask/HG002.mat.dropped.masked.fa | grep '>' | tail

>lcl|HG002#2#h2tg000531l
>lcl|HG002#2#h2tg000532l
>lcl|HG002#2#h2tg000533l
>lcl|HG002#2#h2tg000536l
>lcl|HG002#2#h2tg000537l
>lcl|HG002#2#h2tg000538l
>lcl|HG002#2#h2tg000541c
>lcl|HG002#2#h2tg000542l
>lcl|HG002#2#h2tg000544l
>lcl|HG002#2#MT


In [17]:
! cat dropped/HG002.mat.dropped.fa | grep '>' | tail

>lcl|HG002#2#h2tg000531l
>lcl|HG002#2#h2tg000532l
>lcl|HG002#2#h2tg000533l
>lcl|HG002#2#h2tg000536l
>lcl|HG002#2#h2tg000537l
>lcl|HG002#2#h2tg000538l
>lcl|HG002#2#h2tg000541c
>lcl|HG002#2#h2tg000542l
>lcl|HG002#2#h2tg000544l
>lcl|HG002#2#MT


In [16]:
! cat HG002_maternal_f1_assembly_v2_1_genbank00000000.fsa | grep '>' | tail

>lcl|HG002#2#h2tg000532l Homo sapiens isolate NA24385
>lcl|HG002#2#h2tg000533l Homo sapiens isolate NA24385
>lcl|HG002#2#h2tg000535l Homo sapiens isolate NA24385
>lcl|HG002#2#h2tg000536l Homo sapiens isolate NA24385
>lcl|HG002#2#h2tg000537l Homo sapiens isolate NA24385
>lcl|HG002#2#h2tg000538l Homo sapiens isolate NA24385
>lcl|HG002#2#h2tg000541c Homo sapiens isolate NA24385
>lcl|HG002#2#h2tg000542l Homo sapiens isolate NA24385
>lcl|HG002#2#h2tg000544l Homo sapiens isolate NA24385
>lcl|HG002#2#MT Homo sapiens isolate NA24385 mitochondrion


**So the headers need to be fixed**

# Fix Headers
**Need to add "Homo sapiens isolate NA24385" to the end of all headers, and need to add "mitochondrion" to MT header.**

## Fix By Adding Sample Info And MT Signifier

In [28]:
! mkdir rename_headers

Add sample info

In [29]:
! sed '/>lcl/s/$/ Homo sapiens isolate NA24385/' \
    mask/HG002.mat.dropped.masked.fa \
    > rename_headers/HG002.mat.dropped.masked.renamed_pt1.fa

Add Mito ID

In [30]:
! sed '/MT/s/$/ mitochondrion/' \
    rename_headers/HG002.mat.dropped.masked.renamed_pt1.fa \
    > rename_headers/HG002.mat.dropped.masked.renamed_pt2.fa

## Check Everythign Again On Final File

In [31]:
! cat rename_headers/HG002.mat.dropped.masked.renamed_pt2.fa | grep ">" | tail

>lcl|HG002#2#h2tg000531l Homo sapiens isolate NA24385
>lcl|HG002#2#h2tg000532l Homo sapiens isolate NA24385
>lcl|HG002#2#h2tg000533l Homo sapiens isolate NA24385
>lcl|HG002#2#h2tg000536l Homo sapiens isolate NA24385
>lcl|HG002#2#h2tg000537l Homo sapiens isolate NA24385
>lcl|HG002#2#h2tg000538l Homo sapiens isolate NA24385
>lcl|HG002#2#h2tg000541c Homo sapiens isolate NA24385
>lcl|HG002#2#h2tg000542l Homo sapiens isolate NA24385
>lcl|HG002#2#h2tg000544l Homo sapiens isolate NA24385
>lcl|HG002#2#MT Homo sapiens isolate NA24385 mitochondrion


In [32]:
! cat rename_headers/HG002.mat.dropped.masked.renamed_pt2.fa | grep "Homo sapiens isolate NA24385" | wc -l

445


In [36]:
! cat rename_headers/HG002.mat.dropped.masked.renamed_pt2.fa | grep -v 'lcl|' | grep -o "N" | wc -l

419


In [37]:
! grep -v ">" rename_headers/HG002.mat.dropped.masked.renamed_pt2.fa | wc | awk '{print $3-$1}'

3.06061e+09


# Copy Final FASTA To Bucket

In [42]:
! mkdir final_fasta

## Copy Over Maternal (Changed) File

In [43]:
! gzip rename_headers/HG002.mat.dropped.masked.renamed_pt2.fa

In [50]:
! cp \
    rename_headers/HG002.mat.dropped.masked.renamed_pt2.fa.gz \
    final_fasta/HG002.mat.dropped.masked.renamed_pt2.fa.gz

In [63]:
! gsutil cp \
    final_fasta/HG002.mat.dropped.masked.renamed_pt2.fa.gz \
    {bucket}fix_hg002_genbank/HG002.mat.dropped.masked.renamed_pt2.fa.gz

Copying file://final_fasta/HG002.mat.dropped.masked.renamed_pt2.fa.gz [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run          
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

- [1 files][851.5 MiB/851.5 MiB]   84.5 MiB/s                                   
Operation completed over 1 objects/851.5 MiB.                                    


## Copy Over Paternal (Unchanged) File

In [52]:
! wget \
    https://submit.ncbi.nlm.nih.gov/api/2.0/files/hkf9hszi/foreigncontaminationmodified_hg002_paternal_f1_assembly_v2_1.zip/?format=attachment \
    --output-document final_fasta/foreigncontaminationmodified_hg002_paternal_f1_assembly_v2_1.zip

--2021-06-01 20:47:48--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/hkf9hszi/foreigncontaminationmodified_hg002_paternal_f1_assembly_v2_1.zip/?format=attachment
Resolving submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)... 130.14.29.113, 2607:f220:41e:4290::113
Connecting to submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)|130.14.29.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 849361683 (810M) [application/zip]
Saving to: ‘final_fasta/foreigncontaminationmodified_hg002_paternal_f1_assembly_v2_1.zip’


2021-06-01 20:48:16 (30.1 MB/s) - ‘final_fasta/foreigncontaminationmodified_hg002_paternal_f1_assembly_v2_1.zip’ saved [849361683/849361683]



In [53]:
! unzip final_fasta/foreigncontaminationmodified_hg002_paternal_f1_assembly_v2_1.zip

Archive:  final_fasta/foreigncontaminationmodified_hg002_paternal_f1_assembly_v2_1.zip
  inflating: fasta/HG002_paternal_f1_assembly_v2_100000000.fsa  


In [56]:
! cp \
    fasta/HG002_paternal_f1_assembly_v2_100000000.fsa \
    final_fasta/HG002_paternal_f1_assembly_v2_100000000.fsa

! gzip final_fasta/HG002_paternal_f1_assembly_v2_100000000.fsa

In [60]:
! rm final_fasta/foreigncontaminationmodified_hg002_paternal_f1_assembly_v2_1.zip

In [62]:
! gsutil cp \
    final_fasta/HG002_paternal_f1_assembly_v2_100000000.fsa.gz \
    {bucket}fix_hg002_genbank/HG002_paternal_f1_assembly_v2_100000000.fsa.gz

Copying file://final_fasta/HG002_paternal_f1_assembly_v2_100000000.fsa.gz [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run          
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

| [1 files][810.0 MiB/810.0 MiB]                                                
Operation completed over 1 objects/810.0 MiB.                                    
