# Notebook 13: Summarize Genbank Contamination<a class="tocSkip">

**Downloading Genbank contamination to summarize so we can make sure that the changes all make sense**
    
    
**The steps that we will take are:**
1. Import Statements & Global Variable Definitions
2. TBD

# Import Statements & Global Variable Definitions

## Load Python packages
----

In [2]:
%pip install terra-pandas

Collecting terra-pandas
  Downloading terra-pandas-0.0.1.tar.gz (4.3 kB)
Collecting terra-notebook-utils<0.9.0,>=0.8.1
  Downloading terra-notebook-utils-0.8.1.tar.gz (32 kB)
Collecting google-cloud-storage==1.31.2
  Downloading google_cloud_storage-1.31.2-py2.py3-none-any.whl (88 kB)
[K     |████████████████████████████████| 88 kB 4.6 MB/s eta 0:00:011
[?25hCollecting gs-chunked-io<0.6,>=0.5.1
  Downloading gs-chunked-io-0.5.2.tar.gz (8.1 kB)
Collecting firecloud
  Downloading firecloud-0.16.31.tar.gz (53 kB)
[K     |████████████████████████████████| 53 kB 2.1 MB/s  eta 0:00:01
[?25hCollecting bgzip<0.4,>=0.3.5
  Downloading bgzip-0.3.5.tar.gz (61 kB)
[K     |████████████████████████████████| 61 kB 493 kB/s  eta 0:00:01
[?25hCollecting cli-builder<0.2,>=0.1.5
  Downloading cli-builder-0.1.5.tar.gz (3.5 kB)
Collecting oauth2client
  Downloading oauth2client-4.1.3-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 7.7 MB/s  eta 0:00:01
[?25hCollecting j

  Building wheel for firecloud (setup.py) ... [?25ldone
[?25h  Created wheel for firecloud: filename=firecloud-0.16.31-py3-none-any.whl size=53437 sha256=251ad7e379ed09deaa630c60e5f04e503b7f31777a4f0ce3a8beb2597de15fb4
  Stored in directory: /home/jupyter-user/.cache/pip/wheels/df/5d/2a/cd382b7648f96c90a2fd0114807d83697c9a6c217b0d07d9fe
  Building wheel for wrapt (setup.py) ... [?25ldone
[?25h  Created wheel for wrapt: filename=wrapt-1.12.1-cp37-cp37m-linux_x86_64.whl size=68701 sha256=4a06409f771b3fc1618bae546a3a17f2bdc3b2cbfc47d806b1a6136b93aad39d
  Stored in directory: /home/jupyter-user/.cache/pip/wheels/62/76/4c/aa25851149f3f6d9785f6c869387ad82b3fd37582fa8147ac6
Successfully built terra-pandas terra-notebook-utils bgzip cli-builder gs-chunked-io firecloud wrapt
Installing collected packages: six, pyasn1, urllib3, setuptools, rsa, pyparsing, pycparser, pyasn1-modules, protobuf, idna, chardet, certifi, cachetools, requests, pytz, packaging, googleapis-common-protos, google-auth,

In [1]:
%%capture 
import terra_notebook_utils as tnu
import terra_pandas as tp
import os
import io
import gzip
import pandas as pd
import numpy as np

## Set Environment Variables

In [2]:
# Get the Google billing project name and workspace name
PROJECT = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE = os.path.basename(os.path.dirname(os.getcwd()))
bucket = os.environ['WORKSPACE_BUCKET'] + "/"

# Verify that we've captured the environment variables
print("Billing project: " + PROJECT)
print("Workspace: " + WORKSPACE)
print("Workspace storage bucket: " + bucket)

Billing project: human-pangenome-ucsc
Workspace: HPRC_Reassembly
Workspace storage bucket: gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/


In [3]:
submissions = ["SUB9500620",
                "SUB9547080",
                "SUB9550744",
                "SUB9552074",
                "SUB9556721",
                "SUB9558008",
                "SUB9582250",
                "SUB9583198",
                "SUB9583305",
                "SUB9583513"]

# Function Definitions

In [4]:
def rtn_contam_contigs(contam_fp: str):
    
    start_strings  = ["Trim:"]
    header_strings = ["Sequence name, length, span(s), apparent source"]
    end_string     = ""

    ## Loop through file, and pull contaminations entries. (These are written
    ## in between the header_string and the end_string -- if there are any.)
    
    column_names = ["contig", "size", "start", "stop", "source"]
    contam_df = pd.DataFrame(columns = column_names)
    
    with open(contam_fp) as infile:
        copy = False
        found_hits = False
        
        for line in infile:
            if line.strip() in start_strings:
                copy = True
                continue
            elif line.strip() in header_strings:
                continue
            elif line.strip() == end_string:
                copy = False
                continue
            elif copy:
                contam_line = line.strip().split()
                
                contig = contam_line[0]
                size   = contam_line[1]
                
                ## NCBI puts all sites for one contig into a line
                positions = contam_line[2].split(",")
                starts  = [i.split('..')[0] for i in positions]
                stops   = [i.split('..')[1] for i in positions]
                
                source = ' '.join(contam_line[3:])

                for i in range(0, len(starts)):
                    contam_df = contam_df.append(pd.DataFrame({
                                                       "contig": contig,
                                                       "size": size, 
                                                       "start": starts[i],
                                                       "stop": stops[i],
                                                       "source": source
                                                      }, index=[0]),
                                                      ignore_index=True)
                
                
    return contam_df

In [5]:
def rtn_fixed_contam_contigs(contam_fp: str):
    
    start_strings  = ["Exclude:"]
    header_strings = ["Sequence name, length, apparent source"]
    end_string     = ""

    ## Loop through file, and pull fixed contamination entries. (These are written
    ## in between the header_string and the end_string -- if there are any.)
    
    column_names = ["contig", "size", "source"]
    contam_df = pd.DataFrame(columns = column_names)
    
    with open(contam_fp) as infile:
        copy = False
        found_hits = False
        
        for line in infile:
            if line.strip() in start_strings:
                copy = True
                continue
            elif line.strip() in header_strings:
                continue
            elif line.strip() == end_string:
                copy = False
                continue
            elif copy:
                contam_line = line.strip().split()
                
                contig = contam_line[0]
                size   = contam_line[1]        
                source = ' '.join(contam_line[2:])

                contam_df = contam_df.append(pd.DataFrame({
                                                       "contig": contig,
                                                       "size": size, 
                                                       "source": source
                                                      }, index=[0]),
                                                      ignore_index=True)
                
                
    return contam_df

# Download Files Which List All Info Files For Submissions



In [6]:
! mkdir Genbank
%cd Genbank

mkdir: cannot create directory ‘Genbank’: File exists
/home/jupyter-user/notebooks/HPRC_Reassembly/edit/Genbank


In [7]:
for submission in submissions:
    gs_path = f"gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/2021_05_28_Genbank_Contamination/Submission_files_to_download/{submission}_report_files.txt"
    ! gsutil cp {gs_path} ./

Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/2021_05_28_Genbank_Contamination/Submission_files_to_download/SUB9500620_report_files.txt...
/ [1 files][  4.9 KiB/  4.9 KiB]                                                
Operation completed over 1 objects/4.9 KiB.                                      
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/2021_05_28_Genbank_Contamination/Submission_files_to_download/SUB9547080_report_files.txt...
/ [1 files][  6.1 KiB/  6.1 KiB]                                                
Operation completed over 1 objects/6.1 KiB.                                      
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/2021_05_28_Genbank_Contamination/Submission_files_to_download/SUB9550744_report_files.txt...
/ [1 files][  6.1 KiB/  6.1 KiB]                                                
Operation completed over 1 objects/6.1 KiB.                                      
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/2021_05_28_Genbank_C

In [8]:
! mkdir report_files
! mv * report_files/

mkdir: cannot create directory ‘report_files’: File exists
mv: cannot move 'report_files' to a subdirectory of itself, 'report_files/report_files'


In [9]:
for submission in submissions:
    src_file = f"report_files/{submission}_report_files.txt"
    dst_file = f"report_files/{submission}_report_files_no_zip.txt"
    
    ! grep -v "zip" {src_file} > {dst_file}

In [10]:
%cd report_files

/home/jupyter-user/notebooks/HPRC_Reassembly/edit/Genbank/report_files


# Download Actual Report Files

In [11]:
! mkdir contam_reports

In [12]:
%cd contam_reports

/home/jupyter-user/notebooks/HPRC_Reassembly/edit/Genbank/report_files/contam_reports


In [13]:
for submission in submissions:
    download_fp = f"../{submission}_report_files_no_zip.txt"
    ! wget --content-disposition -i {download_fp}

--2021-06-23 15:16:40--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/qbmq2nhd/remainingcontamination_hg01891_maternal_f1_assembly_v2.txt/?format=attachment
Resolving submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)... 130.14.29.113, 2607:f220:41e:4290::113
Connecting to submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)|130.14.29.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4030 (3.9K) [text/plain]
Saving to: ‘RemainingContamination_HG01891_maternal_f1_assembly_v2.txt’


2021-06-23 15:16:41 (3.63 MB/s) - ‘RemainingContamination_HG01891_maternal_f1_assembly_v2.txt’ saved [4030/4030]

--2021-06-23 15:16:41--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/wqk0dsck/contamination_hg01891_maternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 5016 (4.9K) [text/plain]
Saving to: ‘Contamination_HG01891_maternal_f1_assembly_v2.txt’


2021-06-23 15:16:41 (

HTTP request sent, awaiting response... 200 OK
Length: 11685 (11K) [application/octet-stream]
Saving to: ‘JAGYVI01_accs’


2021-06-23 15:16:43 (11.1 MB/s) - ‘JAGYVI01_accs’ saved [11685/11685]

--2021-06-23 15:16:43--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/uilzwm96/remainingcontamination_hg02486_maternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 4103 (4.0K) [text/plain]
Saving to: ‘RemainingContamination_HG02486_maternal_f1_assembly_v2.txt’


2021-06-23 15:16:43 (77.4 MB/s) - ‘RemainingContamination_HG02486_maternal_f1_assembly_v2.txt’ saved [4103/4103]

--2021-06-23 15:16:43--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/t1sha3x3/contamination_hg02486_maternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 4016 (3.9K) [text/plain]
Saving to: ‘Contamination_HG024

HTTP request sent, awaiting response... 200 OK
Length: 9937 (9.7K) [application/octet-stream]
Saving to: ‘JAGYVK01_accs’


2021-06-23 15:16:44 (10.3 MB/s) - ‘JAGYVK01_accs’ saved [9937/9937]

FINISHED --2021-06-23 15:16:44--
Total wall clock time: 4.0s
Downloaded: 32 files, 205K in 0.05s (3.84 MB/s)
--2021-06-23 15:16:45--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/fste4hqo/remainingcontamination_hg01123_maternal_f1_assembly_v2_1.txt/?format=attachment
Resolving submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)... 130.14.29.113, 2607:f220:41e:4290::113
Connecting to submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)|130.14.29.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3934 (3.8K) [text/plain]
Saving to: ‘RemainingContamination_HG01123_maternal_f1_assembly_v2_1.txt’


2021-06-23 15:16:45 (8.21 MB/s) - ‘RemainingContamination_HG01123_maternal_f1_assembly_v2_1.txt’ saved [3934/3934]

--2021-06-23 15:16:45--  https://submit.ncbi.nlm.nih.gov/api/2.0/fil

HTTP request sent, awaiting response... 200 OK
Length: 5966 (5.8K) [text/plain]
Saving to: ‘FixedForeignContaminations_HG01258_paternal_f1_assembly_v2.txt’


2021-06-23 15:16:46 (5.34 MB/s) - ‘FixedForeignContaminations_HG01258_paternal_f1_assembly_v2.txt’ saved [5966/5966]

--2021-06-23 15:16:46--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/istk9cu4/jagyyv01_accs/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 12369 (12K) [application/octet-stream]
Saving to: ‘JAGYYV01_accs’


2021-06-23 15:16:46 (9.58 MB/s) - ‘JAGYYV01_accs’ saved [12369/12369]

--2021-06-23 15:16:46--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/xvsr1rwj/remainingcontamination_hg01358_maternal_f1_assembly_v2_1.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 4168 (4.1K) [text/plain]
Saving to: ‘RemainingContamination_HG01358_maternal_f1_as


2021-06-23 15:16:47 (11.4 MB/s) - ‘Contamination_HG01361_paternal_f1_assembly_v2.txt’ saved [8624/8624]

--2021-06-23 15:16:47--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/iodveliw/fixedforeigncontaminations_hg01361_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 6600 (6.4K) [text/plain]
Saving to: ‘FixedForeignContaminations_HG01361_paternal_f1_assembly_v2.txt’


2021-06-23 15:16:47 (10.5 MB/s) - ‘FixedForeignContaminations_HG01361_paternal_f1_assembly_v2.txt’ saved [6600/6600]

--2021-06-23 15:16:47--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/uevtnlzt/jagyyx01_accs/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 11001 (11K) [application/octet-stream]
Saving to: ‘JAGYYX01_accs’


2021-06-23 15:16:47 (11.5 MB/s) - ‘JAGYYX01_accs’ saved [11001/11001]

--2021-06-23 15:16:47--  http

HTTP request sent, awaiting response... 200 OK
Length: 5353 (5.2K) [text/plain]
Saving to: ‘Contamination_HG02572_paternal_f1_assembly_v2.txt’


2021-06-23 15:16:50 (158 MB/s) - ‘Contamination_HG02572_paternal_f1_assembly_v2.txt’ saved [5353/5353]

--2021-06-23 15:16:50--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/5vpjadq7/fixedforeigncontaminations_hg02572_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 3155 (3.1K) [text/plain]
Saving to: ‘FixedForeignContaminations_HG02572_paternal_f1_assembly_v2.txt’


2021-06-23 15:16:50 (2.31 MB/s) - ‘FixedForeignContaminations_HG02572_paternal_f1_assembly_v2.txt’ saved [3155/3155]

--2021-06-23 15:16:50--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/olvewjoi/jahaow01_accs/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 24415 (24K) [application/

HTTP request sent, awaiting response... 200 OK
Length: 8521 (8.3K) [text/plain]
Saving to: ‘Contamination_HG02630_paternal_f1_assembly_v2.txt’


2021-06-23 15:16:51 (24.0 MB/s) - ‘Contamination_HG02630_paternal_f1_assembly_v2.txt’ saved [8521/8521]

--2021-06-23 15:16:51--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/dvpu0yhp/fixedforeigncontaminations_hg02630_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 6494 (6.3K) [text/plain]
Saving to: ‘FixedForeignContaminations_HG02630_paternal_f1_assembly_v2.txt’


2021-06-23 15:16:51 (8.64 MB/s) - ‘FixedForeignContaminations_HG02630_paternal_f1_assembly_v2.txt’ saved [6494/6494]

--2021-06-23 15:16:51--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/nhri2cqi/jahaoq01_accs/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 20349 (20K) [application

HTTP request sent, awaiting response... 200 OK
Length: 7700 (7.5K) [text/plain]
Saving to: ‘Contamination_HG02886_paternal_f1_assembly_v2.txt’


2021-06-23 15:16:52 (8.51 MB/s) - ‘Contamination_HG02886_paternal_f1_assembly_v2.txt’ saved [7700/7700]

--2021-06-23 15:16:52--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/hpcv2wo0/fixedforeigncontaminations_hg02886_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 5594 (5.5K) [text/plain]
Saving to: ‘FixedForeignContaminations_HG02886_paternal_f1_assembly_v2.txt’


2021-06-23 15:16:53 (37.8 MB/s) - ‘FixedForeignContaminations_HG02886_paternal_f1_assembly_v2.txt’ saved [5594/5594]

--2021-06-23 15:16:53--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/pywsecma/jahaou01_accs/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 19171 (19K) [application

HTTP request sent, awaiting response... 200 OK
Length: 3943 (3.9K) [text/plain]
Saving to: ‘RemainingContamination_HG01978_paternal_f1_assembly_v2.txt’


2021-06-23 15:16:54 (138 MB/s) - ‘RemainingContamination_HG01978_paternal_f1_assembly_v2.txt’ saved [3943/3943]

--2021-06-23 15:16:54--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/vvnqva3c/contamination_hg01978_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 5133 (5.0K) [text/plain]
Saving to: ‘Contamination_HG01978_paternal_f1_assembly_v2.txt’


2021-06-23 15:16:55 (139 MB/s) - ‘Contamination_HG01978_paternal_f1_assembly_v2.txt’ saved [5133/5133]

--2021-06-23 15:16:55--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/jemhxicj/fixedforeigncontaminations_hg01978_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length

HTTP request sent, awaiting response... 200 OK
Length: 3870 (3.8K) [text/plain]
Saving to: ‘RemainingContamination_HG03540_paternal_f1_assembly_v2.txt’


2021-06-23 15:16:56 (40.9 MB/s) - ‘RemainingContamination_HG03540_paternal_f1_assembly_v2.txt’ saved [3870/3870]

--2021-06-23 15:16:56--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/h5jzziza/contamination_hg03540_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 6874 (6.7K) [text/plain]
Saving to: ‘Contamination_HG03540_paternal_f1_assembly_v2.txt’


2021-06-23 15:16:56 (62.6 MB/s) - ‘Contamination_HG03540_paternal_f1_assembly_v2.txt’ saved [6874/6874]

--2021-06-23 15:16:56--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/bhwd7ebp/fixedforeigncontaminations_hg03540_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Leng

HTTP request sent, awaiting response... 200 OK
Length: 11334 (11K) [application/octet-stream]
Saving to: ‘JAHALX01_accs’


2021-06-23 15:16:58 (18.9 MB/s) - ‘JAHALX01_accs’ saved [11334/11334]

--2021-06-23 15:16:58--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/rdezojpi/remainingcontamination_hg00741_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 3869 (3.8K) [text/plain]
Saving to: ‘RemainingContamination_HG00741_paternal_f1_assembly_v2.txt’


2021-06-23 15:16:58 (63.7 MB/s) - ‘RemainingContamination_HG00741_paternal_f1_assembly_v2.txt’ saved [3869/3869]

--2021-06-23 15:16:58--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/248u7lux/contamination_hg00741_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 5019 (4.9K) [text/plain]
Saving to: ‘Contamination_HG007

HTTP request sent, awaiting response... 200 OK
Length: 12018 (12K) [application/octet-stream]
Saving to: ‘JAHALZ01_accs’


2021-06-23 15:16:59 (100 MB/s) - ‘JAHALZ01_accs’ saved [12018/12018]

--2021-06-23 15:16:59--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/ct7zhz8v/remainingcontamination_hg01175_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 4081 (4.0K) [text/plain]
Saving to: ‘RemainingContamination_HG01175_paternal_f1_assembly_v2.txt’


2021-06-23 15:16:59 (8.41 MB/s) - ‘RemainingContamination_HG01175_paternal_f1_assembly_v2.txt’ saved [4081/4081]

--2021-06-23 15:16:59--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/rmjgoxvg/contamination_hg01175_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 5619 (5.5K) [text/plain]
Saving to: ‘Contamination_HG0117

HTTP request sent, awaiting response... 200 OK
Length: 16122 (16K) [application/octet-stream]
Saving to: ‘JAHAMF01_accs’


2021-06-23 15:17:00 (11.8 MB/s) - ‘JAHAMF01_accs’ saved [16122/16122]

--2021-06-23 15:17:00--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/pusqrzp4/remainingcontamination_hg02148_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 3867 (3.8K) [text/plain]
Saving to: ‘RemainingContamination_HG02148_paternal_f1_assembly_v2.txt’


2021-06-23 15:17:00 (135 MB/s) - ‘RemainingContamination_HG02148_paternal_f1_assembly_v2.txt’ saved [3867/3867]

--2021-06-23 15:17:00--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/xryzxvjk/contamination_hg02148_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 4167 (4.1K) [text/plain]
Saving to: ‘Contamination_HG0214

HTTP request sent, awaiting response... 200 OK
Length: 1883 (1.8K) [text/plain]
Saving to: ‘FixedForeignContaminations_HG00621_maternal_f1_assembly_v2.txt’


2021-06-23 15:17:03 (1.57 MB/s) - ‘FixedForeignContaminations_HG00621_maternal_f1_assembly_v2.txt’ saved [1883/1883]

--2021-06-23 15:17:03--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/wwsdcjez/jahbcc01_accs/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 9662 (9.4K) [application/octet-stream]
Saving to: ‘JAHBCC01_accs’


2021-06-23 15:17:03 (20.0 MB/s) - ‘JAHBCC01_accs’ saved [9662/9662]

--2021-06-23 15:17:03--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/yakreslp/remainingcontamination_hg00621_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 3868 (3.8K) [text/plain]
Saving to: ‘RemainingContamination_HG00621_paternal_f1_assemb

HTTP request sent, awaiting response... 200 OK
Length: 2201 (2.1K) [text/plain]
Saving to: ‘FixedForeignContaminations_HG00735_maternal_f1_assembly_v2.txt’


2021-06-23 15:17:04 (2.83 MB/s) - ‘FixedForeignContaminations_HG00735_maternal_f1_assembly_v2.txt’ saved [2201/2201]

--2021-06-23 15:17:04--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/6vtaau2e/jahbcg01_accs/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 9320 (9.1K) [application/octet-stream]
Saving to: ‘JAHBCG01_accs’


2021-06-23 15:17:04 (12.0 MB/s) - ‘JAHBCG01_accs’ saved [9320/9320]

--2021-06-23 15:17:04--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/kfubvo2u/remainingcontamination_hg00735_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 4163 (4.1K) [text/plain]
Saving to: ‘RemainingContamination_HG00735_paternal_f1_assemb

HTTP request sent, awaiting response... 200 OK
Length: 5901 (5.8K) [text/plain]
Saving to: ‘Contamination_NA19240_maternal_f1_assembly_v2_genbank.txt’


2021-06-23 15:17:06 (20.9 MB/s) - ‘Contamination_NA19240_maternal_f1_assembly_v2_genbank.txt’ saved [5901/5901]

--2021-06-23 15:17:06--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/kncnpqir/fixedforeigncontaminations_na19240_maternal_f1_assembly_v2_genbank.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 3693 (3.6K) [text/plain]
Saving to: ‘FixedForeignContaminations_NA19240_maternal_f1_assembly_v2_genbank.txt’


2021-06-23 15:17:07 (68.9 MB/s) - ‘FixedForeignContaminations_NA19240_maternal_f1_assembly_v2_genbank.txt’ saved [3693/3693]

--2021-06-23 15:17:07--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/igzikfiz/jaheol01_accs/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response...

HTTP request sent, awaiting response... 200 OK
Length: 4178 (4.1K) [text/plain]
Saving to: ‘Contamination_HG03486_maternal_f1_assembly_v2_genbank.txt’


2021-06-23 15:17:08 (117 MB/s) - ‘Contamination_HG03486_maternal_f1_assembly_v2_genbank.txt’ saved [4178/4178]

--2021-06-23 15:17:08--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/w2kxizfu/fixedforeigncontaminations_hg03486_maternal_f1_assembly_v2_genbank.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 2103 (2.1K) [text/plain]
Saving to: ‘FixedForeignContaminations_HG03486_maternal_f1_assembly_v2_genbank.txt’


2021-06-23 15:17:08 (3.93 MB/s) - ‘FixedForeignContaminations_HG03486_maternal_f1_assembly_v2_genbank.txt’ saved [2103/2103]

--2021-06-23 15:17:08--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/0nrnjsmp/jaheop01_accs/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 

HTTP request sent, awaiting response... 200 OK
Length: 4052 (4.0K) [text/plain]
Saving to: ‘Contamination_HG02723_maternal_f1_assembly_v2_genbank.txt’


2021-06-23 15:17:09 (134 MB/s) - ‘Contamination_HG02723_maternal_f1_assembly_v2_genbank.txt’ saved [4052/4052]

--2021-06-23 15:17:09--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/idslaqru/fixedforeigncontaminations_hg02723_maternal_f1_assembly_v2_genbank.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 2050 (2.0K) [text/plain]
Saving to: ‘FixedForeignContaminations_HG02723_maternal_f1_assembly_v2_genbank.txt’


2021-06-23 15:17:09 (2.73 MB/s) - ‘FixedForeignContaminations_HG02723_maternal_f1_assembly_v2_genbank.txt’ saved [2050/2050]

--2021-06-23 15:17:09--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/xwivsljw/jaheot01_accs/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 

HTTP request sent, awaiting response... 200 OK
Length: 3946 (3.9K) [text/plain]
Saving to: ‘RemainingContamination_HG01109_maternal_f1_assembly_v2_genbank.txt’


2021-06-23 15:17:11 (4.14 MB/s) - ‘RemainingContamination_HG01109_maternal_f1_assembly_v2_genbank.txt’ saved [3946/3946]

--2021-06-23 15:17:11--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/wale2a4s/contamination_hg01109_maternal_f1_assembly_v2_genbank.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 5030 (4.9K) [text/plain]
Saving to: ‘Contamination_HG01109_maternal_f1_assembly_v2_genbank.txt’


2021-06-23 15:17:11 (5.23 MB/s) - ‘Contamination_HG01109_maternal_f1_assembly_v2_genbank.txt’ saved [5030/5030]

--2021-06-23 15:17:11--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/mvdqchxn/fixedforeigncontaminations_hg01109_maternal_f1_assembly_v2_genbank.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTT

HTTP request sent, awaiting response... 200 OK
Length: 18335 (18K) [application/octet-stream]
Saving to: ‘JAHEPE01_accs’


2021-06-23 15:17:12 (32.8 MB/s) - ‘JAHEPE01_accs’ saved [18335/18335]

--2021-06-23 15:17:12--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/fsgamyqr/remainingcontamination_hg01243_maternal_f1_assembly_v2_genbank.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 3871 (3.8K) [text/plain]
Saving to: ‘RemainingContamination_HG01243_maternal_f1_assembly_v2_genbank.txt’


2021-06-23 15:17:13 (17.0 MB/s) - ‘RemainingContamination_HG01243_maternal_f1_assembly_v2_genbank.txt’ saved [3871/3871]

--2021-06-23 15:17:13--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/afajhggc/contamination_hg01243_maternal_f1_assembly_v2_genbank.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 3903 (3.8K) [text/plain]

HTTP request sent, awaiting response... 200 OK
Length: 18487 (18K) [application/octet-stream]
Saving to: ‘JAHEOW01_accs’


2021-06-23 15:17:14 (12.2 MB/s) - ‘JAHEOW01_accs’ saved [18487/18487]

FINISHED --2021-06-23 15:17:14--
Total wall clock time: 3.5s
Downloaded: 40 files, 284K in 0.06s (4.82 MB/s)
--2021-06-23 15:17:15--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/cfcxdquz/remainingcontamination_hg02055_maternal_f1_assembly_v2_genbank.txt/?format=attachment
Resolving submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)... 130.14.29.113, 2607:f220:41e:4290::113
Connecting to submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)|130.14.29.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4087 (4.0K) [text/plain]
Saving to: ‘RemainingContamination_HG02055_maternal_f1_assembly_v2_genbank.txt’


2021-06-23 15:17:15 (10.5 MB/s) - ‘RemainingContamination_HG02055_maternal_f1_assembly_v2_genbank.txt’ saved [4087/4087]

--2021-06-23 15:17:15--  https://submit.ncbi.nlm

HTTP request sent, awaiting response... 200 OK
Length: 2837 (2.8K) [text/plain]
Saving to: ‘FixedForeignContaminations_HG03098_paternal_f1_assembly_v2.txt’


2021-06-23 15:17:16 (950 KB/s) - ‘FixedForeignContaminations_HG03098_paternal_f1_assembly_v2.txt’ saved [2837/2837]

--2021-06-23 15:17:16--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/sg1he8lr/jahepm01_accs/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 16967 (17K) [application/octet-stream]
Saving to: ‘JAHEPM01_accs’


2021-06-23 15:17:17 (27.6 MB/s) - ‘JAHEPM01_accs’ saved [16967/16967]

--2021-06-23 15:17:17--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/w0hyim6j/remainingcontamination_hg03492_maternal_f1_assembly_v2_genbank.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 3869 (3.8K) [text/plain]
Saving to: ‘RemainingContamination_HG03492_maternal_

HTTP request sent, awaiting response... 200 OK
Length: 4745 (4.6K) [text/plain]
Saving to: ‘FixedForeignContaminations_HG02109_paternal_f1_assembly_v2.txt’


2021-06-23 15:17:18 (5.17 MB/s) - ‘FixedForeignContaminations_HG02109_paternal_f1_assembly_v2.txt’ saved [4745/4745]

--2021-06-23 15:17:18--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/fc7l6rdz/jahepg01_accs/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 18335 (18K) [application/octet-stream]
Saving to: ‘JAHEPG01_accs’


2021-06-23 15:17:18 (24.0 MB/s) - ‘JAHEPG01_accs’ saved [18335/18335]

--2021-06-23 15:17:18--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/5tl9ggn0/remainingcontamination_hg02145_maternal_f1_assembly_v2_genbank.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 3935 (3.8K) [text/plain]
Saving to: ‘RemainingContamination_HG02145_maternal

HTTP request sent, awaiting response... 200 OK
Length: 7293 (7.1K) [text/plain]
Saving to: ‘Contamination_HG00733_paternal_f1_assembly_v2.txt’


2021-06-23 15:17:20 (8.73 MB/s) - ‘Contamination_HG00733_paternal_f1_assembly_v2.txt’ saved [7293/7293]

--2021-06-23 15:17:20--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/tb5fcjbq/fixedforeigncontaminations_hg00733_paternal_f1_assembly_v2.txt/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 5010 (4.9K) [text/plain]
Saving to: ‘FixedForeignContaminations_HG00733_paternal_f1_assembly_v2.txt’


2021-06-23 15:17:20 (4.65 MB/s) - ‘FixedForeignContaminations_HG00733_paternal_f1_assembly_v2.txt’ saved [5010/5010]

--2021-06-23 15:17:20--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/2bfoxhan/jahepq01_accs/?format=attachment
Reusing existing connection to submit.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 26467 (26K) [application

# Summarize Remaining Contamination

In [14]:
! ls -lh RemainingContamination_NA18906_maternal_f1_assembly_v2_genbank.txt
! ls -lh RemainingContamination_HG03579_paternal_f1_assembly_v2.txt

-rw-rw-r-- 1 jupyter-user users 3.8K May 14 15:34 RemainingContamination_NA18906_maternal_f1_assembly_v2_genbank.txt
-rw-rw-r-- 1 jupyter-user users 3.9K May  4 20:22 RemainingContamination_HG03579_paternal_f1_assembly_v2.txt


In [15]:
contam_files = ! ls RemainingContamination_*.txt

In [16]:
all_samples_df = pd.DataFrame()

for contam_file in contam_files:
    contam_contig_df = rtn_contam_contigs(contam_file)
    
    if contam_contig_df.empty:
       print(f"{contam_file} has no contamination")
    elif all_samples_df.empty:
        all_samples_df = contam_contig_df
    else:
        all_samples_df = all_samples_df.append(contam_contig_df, ignore_index=True)

In [17]:
print(all_samples_df.to_string())

                    contig       size      start       stop                     source
0    HG00438#2#h2tg000011l  101357667   83042823   83042954  mitochondrion-not_cleaned
1    HG00438#2#h2tg000045l   29542437   15133325   15133619  mitochondrion-not_cleaned
2    HG00438#2#h2tg000071l   27579532   22670213   22670468  mitochondrion-not_cleaned
3    HG00438#1#h1tg000005l   39720081   20281640   20281765  mitochondrion-not_cleaned
4    HG00438#1#h1tg000019l  101367096   83037330   83037461  mitochondrion-not_cleaned
5    HG00438#1#h1tg000068l   27582614    4914909    4915164  mitochondrion-not_cleaned
6    HG00438#1#h1tg000075l   24935429   21703895   21704057  mitochondrion-not_cleaned
7    HG00438#1#h1tg000129l    5131911    1530298    1530457  mitochondrion-not_cleaned
8    HG00438#1#h1tg000148l     390313     318800     324641  mitochondrion-not_cleaned
9      HG005#2#h2tg000012l   96890240    4913629    4913884  mitochondrion-not_cleaned
10     HG005#2#h2tg000039l  108226391   830

# Summarize Fixed Contamination

In [18]:
! ls -lh FixedForeignContaminations_HG002_maternal_f1_assembly_v2_1_genbank.txt
! ls -lh FixedForeignContaminations_HG02109_paternal_f1_assembly_v2.txt

ls: cannot access 'FixedForeignContaminations_HG002_maternal_f1_assembly_v2_1_genbank.txt': No such file or directory
-rw-rw-r-- 1 jupyter-user users 4.7K May 18 20:51 FixedForeignContaminations_HG02109_paternal_f1_assembly_v2.txt


**HG002 files are missing, pull them manually**

In [50]:
! wget --content-disposition https://submit.ncbi.nlm.nih.gov/api/2.0/files/hdc3gyhd/fixedforeigncontaminations_hg002_maternal_f1_assembly_v2_1_genbank.txt/?format=attachment

--2021-06-23 16:08:54--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/hdc3gyhd/fixedforeigncontaminations_hg002_maternal_f1_assembly_v2_1_genbank.txt/?format=attachment
Resolving submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)... 130.14.29.113, 2607:f220:41e:4290::113
Connecting to submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)|130.14.29.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3878 (3.8K) [text/plain]
Saving to: ‘FixedForeignContaminations_HG002_maternal_f1_assembly_v2_1_genbank.txt’


2021-06-23 16:08:54 (114 MB/s) - ‘FixedForeignContaminations_HG002_maternal_f1_assembly_v2_1_genbank.txt’ saved [3878/3878]



In [51]:
! wget --content-disposition https://submit.ncbi.nlm.nih.gov/api/2.0/files/itejzjlc/fixedforeigncontaminations_hg002_paternal_f1_assembly_v2_1.txt/?format=attachment

--2021-06-23 16:09:32--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/itejzjlc/fixedforeigncontaminations_hg002_paternal_f1_assembly_v2_1.txt/?format=attachment
Resolving submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)... 130.14.29.113, 2607:f220:41e:4290::113
Connecting to submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)|130.14.29.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4233 (4.1K) [text/plain]
Saving to: ‘FixedForeignContaminations_HG002_paternal_f1_assembly_v2_1.txt’


2021-06-23 16:09:33 (3.56 MB/s) - ‘FixedForeignContaminations_HG002_paternal_f1_assembly_v2_1.txt’ saved [4233/4233]



In [52]:
fixed_contam_files = ! ls FixedForeignContaminations_*.txt

In [53]:
len(fixed_contam_files)

94

In [54]:
all_samples_fixed_df = pd.DataFrame()

for fix_contam_file in fixed_contam_files:
    fix_contam_contig_df = rtn_fixed_contam_contigs(fix_contam_file)
    
    if fix_contam_contig_df.empty:
       print(f"{fix_contam_file} had no contamination fixed")
    elif all_samples_fixed_df.empty:
        all_samples_fixed_df = fix_contam_contig_df
    else:
        all_samples_fixed_df = all_samples_fixed_df.append(fix_contam_contig_df, ignore_index=True)

In [55]:
print(all_samples_fixed_df.to_string())

                     contig    size                          source
0       HG002#2#h2tg000012l   20590        Human gammaherpesvirus 4
1       HG002#2#h2tg000014l   35061        Human gammaherpesvirus 4
2       HG002#2#h2tg000066l   22155        Human gammaherpesvirus 4
3       HG002#2#h2tg000095l   33061        Human gammaherpesvirus 4
4       HG002#2#h2tg000101l   34529        Human gammaherpesvirus 4
5       HG002#2#h2tg000106l   22221        Human gammaherpesvirus 4
6       HG002#2#h2tg000123l   31974        Human gammaherpesvirus 4
7       HG002#2#h2tg000126l   36878        Human gammaherpesvirus 4
8       HG002#2#h2tg000136l   24689        Human gammaherpesvirus 4
9       HG002#2#h2tg000144l   35307        Human gammaherpesvirus 4
10      HG002#2#h2tg000187l   26845        Human gammaherpesvirus 4
11      HG002#2#h2tg000196l   30878        Human gammaherpesvirus 4
12      HG002#2#h2tg000220l   29705        Human gammaherpesvirus 4
13      HG002#2#h2tg000228l   48383        Human

In [33]:
all_samples_fixed_df.rename(columns={'contig': 'dropped'}, inplace=True)

# Pull Accession Mappings

In [37]:
! ls -lh JAGYVH01_accs

-rw-rw-r-- 1 jupyter-user users 11K May 24 16:45 JAGYVH01_accs


In [38]:
accession_files_ls = ! ls *_accs

In [39]:
len(accession_files_ls)

94

In [47]:
all_samples_accessions_df = pd.DataFrame()

for accession_file in accession_files_ls:
    accession_df = pd.read_csv(accession_file, sep="\t", skiprows=2, header=None)
    all_samples_accessions_df = all_samples_accessions_df.append(accession_df, ignore_index=True)

In [48]:
print(all_samples_accessions_df.to_string())

                             0                1
0        HG02257#2#h2tg000001l  JAGYVH010000001
1        HG02257#2#h2tg000002l  JAGYVH010000002
2        HG02257#2#h2tg000003l  JAGYVH010000003
3        HG02257#2#h2tg000004l  JAGYVH010000004
4        HG02257#2#h2tg000005l  JAGYVH010000005
5        HG02257#2#h2tg000006l  JAGYVH010000006
6        HG02257#2#h2tg000007l  JAGYVH010000007
7        HG02257#2#h2tg000008l  JAGYVH010000008
8        HG02257#2#h2tg000009l  JAGYVH010000009
9        HG02257#2#h2tg000010l  JAGYVH010000010
10       HG02257#2#h2tg000011l  JAGYVH010000011
11       HG02257#2#h2tg000012l  JAGYVH010000012
12       HG02257#2#h2tg000013l  JAGYVH010000013
13       HG02257#2#h2tg000014l  JAGYVH010000014
14       HG02257#2#h2tg000015l  JAGYVH010000015
15       HG02257#2#h2tg000016l  JAGYVH010000016
16       HG02257#2#h2tg000017l  JAGYVH010000017
17       HG02257#2#h2tg000018l  JAGYVH010000018
18       HG02257#2#h2tg000019l  JAGYVH010000019
19       HG02257#2#h2tg000020l  JAGYVH01

# Compare Unaligned Contigs (Mobin) To Dropped Contigs (NCBI)

## Read In Alignment Table

In [91]:
aligned_sample_df = tp.table_to_dataframe("align_sample")

aligned_sample_df = aligned_sample_df.drop(index='HG002_downsampled')

aligned_sample_df.head()

Unnamed: 0_level_0,pat_fasta,pat_unmapped_names,pat_unmapped_fasta,mat_unmapped_fasta,mat_unmapped_names,patAssemblyChm13WinnowmapBam,matAssemblyChm13WinnowmapBam,mat_fasta
align_sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
HG002_full,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/8...,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...
HG00438,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/d...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/7...,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...
HG005,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/8...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/6...,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...
HG00621,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/d...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/7...,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...
HG00673,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/8...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/6...,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...


**Loop Through Table And Write Unmapped Contigs**

In [92]:
unmapped_contigs_df = pd.DataFrame()

for index, row in aligned_sample_df.iterrows():
    
    sample_name        = row.name
    
    pat_unmapped_fp = row.pat_unmapped_names
    mat_unmapped_fp = row.mat_unmapped_names
    
    pat_unmapped_df = pd.read_csv(pat_unmapped_fp, header=None)
    mat_unmapped_df = pd.read_csv(mat_unmapped_fp, header=None)

    
    unmapped_contigs_df = unmapped_contigs_df.append(pat_unmapped_df, ignore_index=True)
    unmapped_contigs_df = unmapped_contigs_df.append(mat_unmapped_df, ignore_index=True)

In [93]:
unmapped_contigs_df.columns = ["unmapped"]

## Merge Dropped And Unmapped Contigs

In [100]:
combined_df = all_samples_fixed_df.merge(unmapped_contigs_df, left_on='dropped', right_on='unmapped', how='outer')

## Take a look at contigs that were dropped, but were mapped

In [102]:
dropped_only_df  = combined_df[combined_df['unmapped'].isna()]

In [103]:
len(dropped_only_df["dropped"])

1

In [104]:
dropped_only_df

Unnamed: 0,dropped,size,source,unmapped
1804,HG02559#1#h1tg000336l,27074,Human gammaherpesvirus 4,


## Take a look at contigs that were unmapped, but not dropped

In [105]:
unmapped_only_df = combined_df[combined_df['dropped'].isna()]

In [106]:
len(unmapped_only_df["unmapped"])

364

In [107]:
print(unmapped_only_df.to_string())

     dropped size source               unmapped
2957     NaN  NaN    NaN    HG002#1#h1tg000048l
2958     NaN  NaN    NaN    HG002#1#h1tg000738c
2959     NaN  NaN    NaN    HG002#2#h2tg000535l
2960     NaN  NaN    NaN    HG002#2#h2tg000266l
2961     NaN  NaN    NaN    HG002#2#h2tg000138l
2962     NaN  NaN    NaN    HG002#2#h2tg000541c
2963     NaN  NaN    NaN    HG005#1#h1tg000478l
2964     NaN  NaN    NaN    HG005#1#h1tg000517l
2965     NaN  NaN    NaN    HG005#2#h2tg000437l
2966     NaN  NaN    NaN    HG005#2#h2tg000688l
2967     NaN  NaN    NaN  HG00733#1#h1tg000623l
2968     NaN  NaN    NaN  HG00733#1#h1tg000595l
2969     NaN  NaN    NaN  HG01109#1#h1tg000358l
2970     NaN  NaN    NaN  HG01109#1#h1tg000346l
2971     NaN  NaN    NaN  HG01109#1#h1tg000497l
2972     NaN  NaN    NaN  HG01109#1#h1tg000438l
2973     NaN  NaN    NaN  HG01109#2#h2tg000231l
2974     NaN  NaN    NaN  HG01109#2#h2tg000357l
2975     NaN  NaN    NaN  HG01109#2#h2tg000375l
2976     NaN  NaN    NaN  HG01123#1#h1tg

# Pull Contigs That Don't Align & Were Not Dropped
## Rename HG002 to allow pulling fasta

In [112]:
aligned_sample_df.rename(index={"HG002_full":'HG002'}, inplace=True)

In [113]:
aligned_sample_df.head()

Unnamed: 0_level_0,pat_fasta,pat_unmapped_names,pat_unmapped_fasta,mat_unmapped_fasta,mat_unmapped_names,patAssemblyChm13WinnowmapBam,matAssemblyChm13WinnowmapBam,mat_fasta
align_sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
HG002,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/8...,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...
HG00438,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/d...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/7...,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...
HG005,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/8...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/6...,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...
HG00621,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/d...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/7...,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...
HG00673,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/8...,gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/6...,gs://fc-4310e737-a388-4a10-8c9e-babe06aaf0cf/w...


## Pull Sample and Haplotype From unmapped_only_df

In [114]:
split_string = unmapped_only_df['unmapped'].str.split("#", expand = True)

In [120]:
unmapped_only_df['sample']    = split_string[0]
unmapped_only_df['haplotype'] = split_string[1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [140]:
print(unmapped_only_df)

     dropped size source               unmapped   sample haplotype
2957     NaN  NaN    NaN    HG002#1#h1tg000048l    HG002         1
2958     NaN  NaN    NaN    HG002#1#h1tg000738c    HG002         1
2959     NaN  NaN    NaN    HG002#2#h2tg000535l    HG002         2
2960     NaN  NaN    NaN    HG002#2#h2tg000266l    HG002         2
2961     NaN  NaN    NaN    HG002#2#h2tg000138l    HG002         2
...      ...  ...    ...                    ...      ...       ...
3316     NaN  NaN    NaN  HG03540#1#h1tg000225l  HG03540         1
3317     NaN  NaN    NaN  HG03579#1#h1tg000353l  HG03579         1
3318     NaN  NaN    NaN  HG03579#1#h1tg000362l  HG03579         1
3319     NaN  NaN    NaN  NA19240#1#h1tg000413l  NA19240         1
3320     NaN  NaN    NaN  NA19240#1#h1tg000498l  NA19240         1

[364 rows x 6 columns]


## Write Unmapped Contigs To Files

In [141]:
! mkdir fastas
%cd fastas

!mkdir unmapped_fastas

/home/jupyter-user/notebooks/HPRC_Reassembly/edit/Genbank/report_files/contam_reports/fastas


In [154]:
for index, row in unmapped_only_df.iterrows():
    
    sample    = row['sample']
    haplotype = row['haplotype']
    contig    = row['unmapped']
    
    is_aligned_sample_row = aligned_sample_df.index == sample
    
    if int(haplotype) == 1:
        fasta_fp = aligned_sample_df.loc[is_aligned_sample_row, 'pat_unmapped_fasta'].values[0]
    else:
        fasta_fp = aligned_sample_df.loc[is_aligned_sample_row, 'mat_unmapped_fasta'].values[0]
        
    ! gsutil cp {fasta_fp} ./
    
    unmapped_fn = os.path.basename(fasta_fp)
    
    ! samtools faidx {unmapped_fn} {contig} > unmapped_fastas/"{contig}.fa"

Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/6780b30f-78ba-45ad-937e-602c4178a14c/call-writeUnmapped/HG002.paternal.f1_assembly_v2.1.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][  1.4 MiB/  1.4 MiB]                                                
Operation completed over 1 objects/1.4 MiB.                                      
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/6780b30f-78ba-45ad-937e-602c4178a14c/call-writeUnmapped/HG002.paternal.f1_assembly_v2.1.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][  1.4 MiB/  1.4 MiB]                                                
Operation completed over 1 objects/1.4 MiB.                                      
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/9b2a58c0-0c29-4b98-8b11-d96d1c8b90be/writeUnmappedReads/79618723-b464-4c9d-95ad-732859c19ab2/call-writeUnmapped/H

/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Cop

/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
- [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Cop

/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Cop

/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Cop

/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Cop

/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Cop

/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Cop

/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Cop

/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Cop

/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Cop

/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Cop

/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Cop

/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Cop

/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Cop

/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/d268cdc9-1f7d-4a40-9446-4acc4f431d72/call-writeUnmapped/HG02145.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Cop

/ [1 files][ 16.8 MiB/ 16.8 MiB]                                                
Operation completed over 1 objects/16.8 MiB.                                     
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/7047b757-ce90-42bc-9620-2693555f5b55/call-writeUnmapped/HG02559.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][  3.9 MiB/  3.9 MiB]                                                
Operation completed over 1 objects/3.9 MiB.                                      
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/c5be77a0-0e8e-40c2-9840-f2249d1a8d18/call-writeUnmapped/HG02572.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][  1.1 MiB/  1.1 MiB]                                                
Operation completed over 1 objects/1.1 MiB.                                      
Cop

/ [1 files][  2.4 MiB/  2.4 MiB]                                                
Operation completed over 1 objects/2.4 MiB.                                      
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/73bff718-ef5d-435c-b2bf-1bac08aae757/call-writeUnmapped/HG03579.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][  2.5 MiB/  2.5 MiB]                                                
Operation completed over 1 objects/2.5 MiB.                                      
Copying gs://fc-0c2122a8-6725-4199-b90e-828ab006078f/a3198dec-b3bd-4e3e-adc8-2eea43587c85/writeUnmappedReads/73bff718-ef5d-435c-b2bf-1bac08aae757/call-writeUnmapped/HG03579.paternal.f1_assembly_v2.chm13_v1.0_plusY.Winnowmap.sorted_unmapped_contigs.fa...
/ [1 files][  2.5 MiB/  2.5 MiB]                                                
Operation completed over 1 objects/2.5 MiB.                                      
Cop

In [157]:
! ls unmapped_fastas | wc -l

364


In [158]:
! gsutil cp -r unmapped_fastas {bucket}EBV_check/unmapped_fastas/

Copying file://unmapped_fastas/HG02145#1#h1tg000757l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000703l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000965l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000806l.fa [Content-Type=application/octet-stream]...
/ [4 files][141.4 KiB/141.4 KiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying file://unmapped_fastas/HG02145#1#h1tg000995l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000864l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02630#1#h1tg000447l.fa [Content-Type=app

Copying file://unmapped_fastas/HG02145#1#h1tg000557l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000656l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000746l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000823l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000554l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000689l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000467l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000687l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG03492#1#h1tg000538l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000494l.fa [Content-Type=application/octet-stream]...
Copying fi

Copying file://unmapped_fastas/HG02145#1#h1tg000836l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000583l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000978l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000642l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG03098#1#h1tg000308l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000463l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg001056l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000537l.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG002#1#h1tg000738c.fa [Content-Type=application/octet-stream]...
Copying file://unmapped_fastas/HG02145#1#h1tg000877l.fa [Content-Type=application/octet-stream]...
Copying file

In [161]:
%cd ..

/home/jupyter-user/notebooks/HPRC_Reassembly/edit/Genbank/report_files


In [164]:
! cat SUB9583513_report_files.txt | grep hg002

https://submit.ncbi.nlm.nih.gov/api/2.0/files/8jtviypy/foreigncontaminationmodified_hg002_maternal_f1_assembly_v2_1_genbank.zip/?format=attachment
https://submit.ncbi.nlm.nih.gov/api/2.0/files/uuwykxej/remainingcontamination_hg002_maternal_f1_assembly_v2_1_genbank.txt/?format=attachment
https://submit.ncbi.nlm.nih.gov/api/2.0/files/0jfct4oa/contamination_hg002_maternal_f1_assembly_v2_1_genbank.txt/?format=attachment
https://submit.ncbi.nlm.nih.gov/api/2.0/files/hdc3gyhd/fixedforeigncontaminations_hg002_maternal_f1_assembly_v2_1_genbank.txt/?format=attachment
https://submit.ncbi.nlm.nih.gov/api/2.0/files/hkf9hszi/foreigncontaminationmodified_hg002_paternal_f1_assembly_v2_1.zip/?format=attachment
https://submit.ncbi.nlm.nih.gov/api/2.0/files/itejzjlc/fixedforeigncontaminations_hg002_paternal_f1_assembly_v2_1.txt/?format=attachment
https://submit.ncbi.nlm.nih.gov/api/2.0/files/hvq6ktpn/remainingcontamination_hg002_paternal_f1_assembly_v2_1.txt/?format=attachment
https://subm

In [165]:
! wget https://submit.ncbi.nlm.nih.gov/api/2.0/files/8jtviypy/foreigncontaminationmodified_hg002_maternal_f1_assembly_v2_1_genbank.zip/?format=attachment

--2021-06-17 17:47:23--  https://submit.ncbi.nlm.nih.gov/api/2.0/files/8jtviypy/foreigncontaminationmodified_hg002_maternal_f1_assembly_v2_1_genbank.zip/?format=attachment
Resolving submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)... 130.14.29.113, 2607:f220:41e:4290::113
Connecting to submit.ncbi.nlm.nih.gov (submit.ncbi.nlm.nih.gov)|130.14.29.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 885141678 (844M) [application/zip]
Saving to: ‘index.html?format=attachment’


2021-06-17 17:47:59 (23.7 MB/s) - ‘index.html?format=attachment’ saved [885141678/885141678]

