In [1]:
import numpy as np
import pandas as pd
import sys
import os
import re
BASE_DIR="/private/groups/hprc/qc_hmm_flagger/hprc_intermediate_assembly/assembly_qc"

### This notebook:

#### Create HiFi and ONT table for batch3 data (06 March 2025)

Now we should have all ONT/HiFi sequencing data fully wrangled and also censat annotations created for all assemblies. These are the links to the related tables:
* Assembly: https://raw.githubusercontent.com/human-pangenomics/hprc_intermediate_assembly/refs/heads/main/data_tables/assemblies_pre_release_v0.6.1.index.csv
* censat: https://raw.githubusercontent.com/human-pangenomics/hprc_intermediate_assembly/refs/heads/main/data_tables/annotation/censat/censat_pre_release_v0.3.index.csv
* ONT: https://raw.githubusercontent.com/human-pangenomics/hprc_intermediate_assembly/refs/heads/main/data_tables/sequencing_data/data_ont_pre_release.index.csv
* HiFi: https://raw.githubusercontent.com/human-pangenomics/hprc_intermediate_assembly/refs/heads/main/data_tables/sequencing_data/data_hifi_pre_release.index.csv
* HiFi(DeepConsensus): https://raw.githubusercontent.com/human-pangenomics/hprc_intermediate_assembly/refs/heads/main/data_tables/sequencing_data/data_deepconsensus_pre_release.index.csv

Steps for this notebook:
* Download pre-release tables
* Get the list of samples included in the previous batches (batch1, batch1_jan_12_2025, batch2). This will be done  for HiFi and ONT separately. The csv files created for `rerun_march_01_2025` will be used for listing those samples. They were created to rerun HMM-Flagger but with version v1.2 using all the mappings created for the batches mentioned earlier:
    * HiFi:
        * `rerun_march_01_2025/hmm_flagger/hifi/hmm_flagger_hifi_data_table.csv`
    * ONT
        * `rerun_march_01_2025/hmm_flagger/ont/hmm_flagger_ont_data_table.csv`
* Find which samples have censat annotation but were missed in the previous batches (either because of not having read data or censat annotation)
* Get ONT and HiFi reads for those samples
* Download the censat annotations and make diploid bed files
* Makes separate data tables for HiFi and ONT runs (both will contain diploid censat bed files)
* Saves the final data tables in `hifi/` and `ont/` subdirectories and they will be used for creating input json files

### Download and parse pre-release tables

In [2]:
#!wget https://raw.githubusercontent.com/human-pangenomics/hprc_intermediate_assembly/refs/heads/main/data_tables/annotation/censat/censat_pre_release_v0.3.index.csv
#!wget https://raw.githubusercontent.com/human-pangenomics/hprc_intermediate_assembly/refs/heads/main/data_tables/sequencing_data/data_ont_pre_release.index.csv
#!wget https://raw.githubusercontent.com/human-pangenomics/hprc_intermediate_assembly/refs/heads/main/data_tables/sequencing_data/data_hifi_pre_release.index.csv
#!wget https://raw.githubusercontent.com/human-pangenomics/hprc_intermediate_assembly/refs/heads/main/data_tables/assemblies_pre_release_v0.6.1.index.csv
#!wget https://raw.githubusercontent.com/human-pangenomics/hprc_intermediate_assembly/refs/heads/main/data_tables/sequencing_data/data_deepconsensus_pre_release.index.csv

In [3]:
!ls

assemblies_pre_release_v0.6.1.index.csv
censat_pre_release_v0.3.index.csv
data_deepconsensus_pre_release.index.csv
data_hifi_pre_release.index.csv
data_ont_pre_release.index.csv
diploid_censat_beds
hifi
make_hmm_flagger_data_tables_batch3.ipynb
ont


In [4]:
assembly_pre_release = pd.read_csv("assemblies_pre_release_v0.6.1.index.csv")
censat_pre_release = pd.read_csv("censat_pre_release_v0.3.index.csv")
data_ont_pre_release = pd.read_csv("data_ont_pre_release.index.csv")
data_hifi_pre_release = pd.read_csv("data_hifi_pre_release.index.csv")
data_hifi_dc_pre_release = pd.read_csv("data_deepconsensus_pre_release.index.csv")

In [5]:
rerun_hifi_table_march_01 = pd.read_csv(f"{BASE_DIR}/rerun_march_01_2025/hmm_flagger/hifi/hmm_flagger_hifi_data_table.csv")
rerun_ont_table_march_01 = pd.read_csv(f"{BASE_DIR}/rerun_march_01_2025/hmm_flagger/ont/hmm_flagger_ont_data_table.csv")

In [6]:
rerun_hifi_table_march_01_samples = rerun_hifi_table_march_01['sample_id']
print("Number of samples already ran with HiFi data : ", len(rerun_hifi_table_march_01_samples))

Number of samples already ran with HiFi data :  212


In [7]:
rerun_ont_table_march_01_samples = rerun_ont_table_march_01['sample_id']
print("Number of samples already ran with ONT data : ", len(rerun_ont_table_march_01_samples))

Number of samples already ran with ONT data :  195


### Find samples that were missed before but we have new data for them now

In [8]:
new_samples_hifi = list(set(censat_pre_release['sample_id']).difference(rerun_hifi_table_march_01_samples))
print(f"These are  {len(new_samples_hifi)}  samples with new HiFi data that should be run for batch3 : ")
print("\n".join(new_samples_hifi))

These are  19  samples with new HiFi data that should be run for batch3 : 
NA18959
HG02027
NA20762
NA18982
NA18940
NA18960
HG005
HG01786
HG06807
NA20806
NA20503
NA18967
NA18944
NA18970
NA18948
HG00733
NA20827
NA18945
NA18943


In [9]:
new_samples_ont = list(set(censat_pre_release['sample_id']).difference(rerun_ont_table_march_01_samples))
print(f"These are  {len(new_samples_ont)}  samples with new ONT data that should be run for batch3 : ")
print("\n".join(new_samples_ont))

These are  36  samples with new ONT data that should be run for batch3 : 
NA20762
HG02109
HG01786
HG03486
HG02145
HG06807
NA20806
HG02055
NA18944
NA18906
NA18970
HG00733
NA19240
NA18945
HG02080
NA18943
NA18948
NA20752
HG02818
NA18959
HG02027
HG03098
NA20799
NA18982
NA18940
NA18960
HG005
NA19159
NA20129
HG01109
NA21309
NA20503
NA18967
NA20827
HG02723
HG01243


In [10]:
def addDownsampledColumn(merged_reads_table, coverage_threshold):
    merged_reads_table["read_files_downsampled"] = [[] for _ in range(len(merged_reads_table))]
    merged_reads_table["total_coverage_downsampled"] = 0

    for i in range(merged_reads_table.shape[0]):
        coverages = merged_reads_table["coverage"][i]
        paths = merged_reads_table["read_files"][i]
        coverage_path_tuples = [(c, p) for c, p in zip(coverages, paths)]
        coverage_path_tuples.sort(key=lambda x: x[0], reverse=True)
        summed_coverage = 0
        downsampled_paths = []
        for j in range(len(coverage_path_tuples)):
            summed_coverage += coverage_path_tuples[j][0]
            merged_reads_table.loc[i, "read_files_downsampled"].append(coverage_path_tuples[j][1])
            if summed_coverage >= coverage_threshold:
                break
        merged_reads_table.loc[i, "total_coverage_downsampled"] = round(summed_coverage,2)

    merged_reads_table["number_of_read_files_downsampled"] = merged_reads_table["read_files_downsampled"].apply(len)
    merged_reads_table["number_of_cores_per_task_downsampled"] = (totalCores / merged_reads_table["number_of_read_files_downsampled"]).astype(int)
    merged_reads_table['number_of_cores_per_task_downsampled'] = merged_reads_table['number_of_cores_per_task_downsampled'].apply(lambda x: max(4,x))

### Make a merged read table for HiFi and downsample if neccessary

In [11]:
data_hifi_pre_release.head()

Unnamed: 0,sample_ID,filetype,filename,path,data_type,production,coverage,deepconsensus_coverage,deepconsensus_filename,deepconsensus_path,...,quartile_25,quartile_50,quartile_75,ntsm_score,MM_tag,primrose_filename,MM_review,MM_remove,lima_version,lima_float_version
0,HG00099,bam,m54329U_220825_174247-bc2012.5mc.hifi_reads.bam,s3://human-pangenomics/working/HPRC/HG00099/ra...,unaligned reads,UW_HPRC_HiFi_Y3,15.4,18.1,HG00099.m54329U_220825_174247.dc.q20.fastq.gz,s3://human-pangenomics/submissions/42AFCE59-29...,...,17755,20124,23333,,True,,True,False,2.5.1,2.0501
1,HG00099,bam,m54329U_220827_143814-bc2050.5mc.hifi_reads.bam,s3://human-pangenomics/working/HPRC/HG00099/ra...,unaligned reads,UW_HPRC_HiFi_Y3,14.7,17.07,HG00099.m54329U_220827_143814.dc.q20.fastq.gz,s3://human-pangenomics/submissions/42AFCE59-29...,...,17274,19325,22189,,True,,True,False,2.5.1,2.0501
2,HG00280,bam,m54329U_220901_221341-bc2051.5mc.hifi_reads.bam,s3://human-pangenomics/working/HPRC/HG00280/ra...,unaligned reads,UW_HPRC_HiFi_Y3,15.2,17.78,HG00280.m54329U_220901_221341.dc.q20.fastq.gz,s3://human-pangenomics/submissions/42AFCE59-29...,...,16982,19131,22147,,True,,True,False,2.5.1,2.0501
3,HG00558,bam,m54329U_220107_233847-bc1016.5mc.hifi_reads.bam,s3://human-pangenomics/working/HPRC/HG00558/ra...,unaligned reads,UW_HPRC_HiFi_Y3,10.9,12.77,HG00558.m54329U_220107_233847.dc.q20.fastq.gz,s3://human-pangenomics/submissions/42AFCE59-29...,...,16632,20068,24477,,True,,True,False,2.5.0,2.05
4,HG00639,bam,m54329U_211222_104516-bc1010.5mc.hifi_reads.bam,s3://human-pangenomics/working/HPRC/HG00639/ra...,unaligned reads,UW_HPRC_HiFi_Y3,9.7,12.43,HG00639.m54329U_211222_104516.dc.q20.fastq.gz,s3://human-pangenomics/submissions/42AFCE59-29...,...,15750,18909,22999,,True,,True,False,2.5.0,2.05


In [12]:
# merge hifi full table
totalCores = 64
merged_hifi_full_table = data_hifi_pre_release.groupby("sample_ID", as_index=False).agg(lambda x: list(x))
merged_hifi_full_table.rename(columns={"sample_ID": "sample_id"}, inplace=True)
merged_hifi_full_table.rename(columns={"path": "read_files"}, inplace=True)
merged_hifi_full_table["total_coverage"] = merged_hifi_full_table["coverage"].apply(sum)
merged_hifi_full_table["number_of_read_files"] = merged_hifi_full_table["read_files"].apply(len)
merged_hifi_full_table["number_of_cores_per_task"] = (totalCores / merged_hifi_full_table["number_of_read_files"]).astype(int)
merged_hifi_full_table['number_of_cores_per_task'] = merged_hifi_full_table['number_of_cores_per_task'].apply(lambda x: max(4,x))
merged_hifi_full_table["mapper_preset"] = "lr:hqae"
merged_hifi_full_table["kmer_size"] = 25

merged_hifi_full_table["hmm_flagger_preset"] = 'hifi'

In [13]:
# max coverage is so high
max(merged_hifi_full_table['total_coverage'])

189.79999999999998

In [14]:
# Read files will be sorted by coverage and selected from the file with highest coverage.
# We don't include more read files once the cumulative sum of coverage is greater than 60x
addDownsampledColumn(merged_hifi_full_table, 60)
max(merged_hifi_full_table['total_coverage_downsampled'])

81.6

### Make a merged read table for HiFi (DeepConsensus) and downsample if neccessary

In [15]:
data_hifi_dc_pre_release.head()

Unnamed: 0,accession,study,bioproject_accession,biosample_accession,library_ID,title,library_strategy,library_source,library_selection,library_layout,...,N75,sample_ID,path,production,data_type,notes,MM_tag,coverage,ntsm_score,ccs_algorithm
0,SRR25618944,SRP305758,PRJNA701308,SAMN26237490,HG00423_lib1_dc,PacBio HiFi sequencing of HG00423 rebasecalled...,WGS,GENOMIC,size fractionation,single,...,20614,HG00423,s3://human-pangenomics/working/HPRC/HG00423/ra...,HPRC_DEEPCONSENSUS_v1pt2,unaligned reads,,,12.91,,
1,SRR25618943,SRP305758,PRJNA701308,SAMN26237491,HG00544_lib1_dc,PacBio HiFi sequencing of HG00544 rebasecalled...,WGS,GENOMIC,size fractionation,single,...,22870,HG00544,s3://human-pangenomics/working/HPRC/HG00544/ra...,HPRC_DEEPCONSENSUS_v1pt2,unaligned reads,,,10.26,,
2,SRR25618932,SRP305758,PRJNA701308,SAMN26237491,HG00544_lib2_dc,PacBio HiFi sequencing of HG00544 rebasecalled...,WGS,GENOMIC,size fractionation,single,...,20289,HG00544,s3://human-pangenomics/working/HPRC/HG00544/ra...,HPRC_DEEPCONSENSUS_v1pt2,unaligned reads,,,13.74,,
3,SRR25618921,SRP305758,PRJNA701308,SAMN26237492,HG00609_lib1_dc,PacBio HiFi sequencing of HG00609 rebasecalled...,WGS,GENOMIC,size fractionation,single,...,23774,HG00609,s3://human-pangenomics/working/HPRC/HG00609/ra...,HPRC_DEEPCONSENSUS_v1pt2,unaligned reads,,,12.6,,
4,SRR25618910,SRP305758,PRJNA701308,SAMN26267378,HG00642.HFSS_dc,PacBio HiFi sequencing of HG00642 rebasecalled...,WGS,GENOMIC,size fractionation,single,...,23857,HG00642,s3://human-pangenomics/working/HPRC/HG00642/ra...,HPRC_DEEPCONSENSUS_v1pt2,unaligned reads,,,13.52,,


In [16]:
# merge hifi full table
totalCores = 64
merged_hifi_dc_full_table = data_hifi_dc_pre_release.groupby("sample_ID", as_index=False).agg(lambda x: list(x))
merged_hifi_dc_full_table.rename(columns={"sample_ID": "sample_id"}, inplace=True)
merged_hifi_dc_full_table.rename(columns={"path": "read_files"}, inplace=True)
merged_hifi_dc_full_table["total_coverage"] = merged_hifi_dc_full_table["coverage"].apply(sum)
merged_hifi_dc_full_table["number_of_read_files"] = merged_hifi_dc_full_table["read_files"].apply(len)
merged_hifi_dc_full_table["number_of_cores_per_task"] = (totalCores / merged_hifi_dc_full_table["number_of_read_files"]).astype(int)
merged_hifi_dc_full_table['number_of_cores_per_task'] = merged_hifi_dc_full_table['number_of_cores_per_task'].apply(lambda x: max(4,x))
merged_hifi_dc_full_table["mapper_preset"] = "lr:hqae"
merged_hifi_dc_full_table["kmer_size"] = 25

merged_hifi_dc_full_table["hmm_flagger_preset"] = 'hifi'

In [17]:
# max coverage is so high
max(merged_hifi_dc_full_table['total_coverage'])

84.47

In [18]:
# Read files will be sorted by coverage and selected from the file with highest coverage.
# We don't include more read files once the cumulative sum of coverage is greater than 60x
addDownsampledColumn(merged_hifi_dc_full_table, 60)
max(merged_hifi_dc_full_table['total_coverage_downsampled'])

70.21

### Make a merged read table for ONT and downsample if neccessary

In [19]:
data_ont_pre_release.head()

Unnamed: 0,filename,filetype,sample_ID,biosample_accession,library_ID,library_strategy,library_source,library_selection,library_layout,platform,...,300kb+,400kb+,500kb+,1Mb+,whales,accession,study,bioproject_accession,production,sequencing_chemistry
0,03_14_23_R941_HG00621_1_Guppy_6.5.7_450bps_mod...,bam,HG00621,SAMN17861653,03_14_23_R941_HG00621_1_Guppy_6.5.7_450bps_mod...,WGS,GENOMIC,RANDOM,single,OXFORD_NANOPORE,...,0.35,0.07,0.01,0.0,1,SRR31367103,SRP305758,PRJNA701308,UCSC_HPRC_ONT_Y1_WTOPUP_GUPPY6,R941
1,03_14_23_R941_HG00621_2_Guppy_6.5.7_450bps_mod...,bam,HG00621,SAMN17861653,03_14_23_R941_HG00621_2_Guppy_6.5.7_450bps_mod...,WGS,GENOMIC,RANDOM,single,OXFORD_NANOPORE,...,0.32,0.06,0.01,0.0,0,SRR31367102,SRP305758,PRJNA701308,UCSC_HPRC_ONT_Y1_WTOPUP_GUPPY6,R941
2,03_14_23_R941_HG01952_1_Guppy_6.5.7_450bps_mod...,bam,HG01952,SAMN17861661,03_14_23_R941_HG01952_1_Guppy_6.5.7_450bps_mod...,WGS,GENOMIC,RANDOM,single,OXFORD_NANOPORE,...,0.2,0.02,0.01,0.0,0,SRR31366972,SRP305758,PRJNA701308,UCSC_HPRC_ONT_Y1_WTOPUP_GUPPY6,R941
3,03_14_23_R941_HG01952_2_Guppy_6.5.7_450bps_mod...,bam,HG01952,SAMN17861661,03_14_23_R941_HG01952_2_Guppy_6.5.7_450bps_mod...,WGS,GENOMIC,RANDOM,single,OXFORD_NANOPORE,...,0.11,0.01,0.0,0.0,0,SRR31367144,SRP305758,PRJNA701308,UCSC_HPRC_ONT_Y1_WTOPUP_GUPPY6,R941
4,03_14_23_R941_HG02148_1_Guppy_6.5.7_450bps_mod...,bam,HG02148,SAMN17861663,03_14_23_R941_HG02148_1_Guppy_6.5.7_450bps_mod...,WGS,GENOMIC,RANDOM,single,OXFORD_NANOPORE,...,0.23,0.04,0.01,0.0,0,SRR31367119,SRP305758,PRJNA701308,UCSC_HPRC_ONT_Y1_WTOPUP_GUPPY6,R941


In [20]:
totalCores = 64
# merge ont table
merged_ont_full_table = data_ont_pre_release.groupby("sample_ID", as_index=False).agg(lambda x: list(x))
merged_ont_full_table.rename(columns={"sample_ID": "sample_id"} ,inplace=True)
merged_ont_full_table.rename(columns={"path": "read_files"}, inplace=True)
merged_ont_full_table["total_coverage"] = merged_ont_full_table["coverage"].apply(sum).apply(lambda x: round(x,2))
merged_ont_full_table["sequencing_chemistry"] = merged_ont_full_table["sequencing_chemistry"].apply(lambda x : x[0 ]if len(set(x)) == 1 else ",".join(set(x)))
merged_ont_full_table["number_of_read_files"] = merged_ont_full_table["read_files"].apply(len)
merged_ont_full_table["number_of_cores_per_task"] = (totalCores / merged_ont_full_table["number_of_read_files"]).astype(int)
merged_ont_full_table['number_of_cores_per_task'] = merged_ont_full_table['number_of_cores_per_task'].apply(lambda x: max(4,x))

# preset for R1041 is lr:hqae
# preset for R941 is map-ont
merged_ont_full_table["mapper_preset"] = ""
merged_ont_full_table["mapper_preset"][merged_ont_full_table["sequencing_chemistry"] == "R1041"] = "lr:hqae"
merged_ont_full_table["mapper_preset"][merged_ont_full_table["sequencing_chemistry"] == "R941"] = "map-ont"
merged_ont_full_table["kmer_size"] = 0
merged_ont_full_table["kmer_size"][merged_ont_full_table["sequencing_chemistry"] == "R1041"] = 25
merged_ont_full_table["kmer_size"][merged_ont_full_table["sequencing_chemistry"] == "R941"] = 15

merged_ont_full_table["hmm_flagger_preset"] = ''
merged_ont_full_table["hmm_flagger_preset"][merged_ont_full_table["sequencing_chemistry"] == "R1041"] = 'ont-r10'
merged_ont_full_table["hmm_flagger_preset"][merged_ont_full_table["sequencing_chemistry"] == "R941"] = 'ont-r9'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-vie

In [21]:
# max coverage is so high
max(merged_ont_full_table['total_coverage'])

228.17

In [22]:
# Read files will be sorted by coverage and selected from the file with highest coverage.
# We don't include more read files once the cumulative sum of coverage is greater than 60x
addDownsampledColumn(merged_ont_full_table, 60)
max(merged_ont_full_table['total_coverage_downsampled'])

87.17

## Make assembly table

In [23]:
assembly_pre_release.head()

Unnamed: 0,sample_id,haplotype,phasing,assembly_method,assembly_method_version,assembly_date,assembly_name,source,genbank_accession,assembly_md5,assembly_fai,assembly_gzi,assembly
0,HG00408,1,trio,hifiasm,0.19.7,2024-08,HG00408_pat_hprc_r2_v1.0.1,hprc,GCA_041900255.1,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...
1,HG00597,1,trio,hifiasm,0.19.7,2024-08,HG00597_pat_hprc_r2_v1.0.1,hprc,GCA_041900365.1,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...
2,HG01192,1,trio,hifiasm,0.19.7,2024-08,HG01192_pat_hprc_r2_v1.0.1,hprc,GCA_041900145.1,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...
3,HG01261,1,trio,hifiasm,0.19.7,2024-08,HG01261_pat_hprc_r2_v1.0.1,hprc,GCA_041900235.1,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...
4,HG02015,1,trio,hifiasm,0.19.7,2024-08,HG02015_pat_hprc_r2_v1.0.1,hprc,GCA_041900165.1,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...


In [24]:
# make two assembly tables one for hap1 and one for hap2
assembly_pre_release_hap1 = assembly_pre_release[assembly_pre_release['haplotype'] == 1]
assembly_pre_release_hap2 = assembly_pre_release[assembly_pre_release['haplotype'] == 2]

# Merging the DataFrames on 'sample_id'
assembly_pre_release_diploid = pd.merge(assembly_pre_release_hap1,
                                        assembly_pre_release_hap2,
                                        on='sample_id',
                                        suffixes=('_hap1', '_hap2'))

# keep only neccessary columns
assembly_pre_release_diploid = assembly_pre_release_diploid[["sample_id", 
                                                             "assembly_hap1",
                                                             "assembly_hap2"]]
assembly_pre_release_diploid.head()

Unnamed: 0,sample_id,assembly_hap1,assembly_hap2
0,HG00408,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...
1,HG00597,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...
2,HG01192,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...
3,HG01261,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...
4,HG02015,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...


## Make a table for diploid censat bed files

In [26]:
censat_pre_release.head()

Unnamed: 0,sample_id,haplotype,assembly_name,location,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15
0,HG00408,hap1,HG00408_pat_hprc_r2_v1,s3://human-pangenomics/submissions/DC27718F-5F...,,,,,,,,,,,,HG00408_pat_hprc_r2_v1.cenSat.bed
1,HG00408,hap2,HG00408_mat_hprc_r2_v1,s3://human-pangenomics/submissions/DC27718F-5F...,,,,,,,,,,,,HG00408_mat_hprc_r2_v1.cenSat.bed
2,HG00597,hap1,HG00597_pat_hprc_r2_v1,s3://human-pangenomics/submissions/DC27718F-5F...,,,,,,,,,,,,HG00597_pat_hprc_r2_v1.cenSat.bed
3,HG00597,hap2,HG00597_mat_hprc_r2_v1,s3://human-pangenomics/submissions/DC27718F-5F...,,,,,,,,,,,,HG00597_mat_hprc_r2_v1.cenSat.bed
4,HG01192,hap1,HG01192_pat_hprc_r2_v1,s3://human-pangenomics/submissions/DC27718F-5F...,,,,,,,,,,,,HG01192_pat_hprc_r2_v1.cenSat.bed


In [27]:
# rename the name of the column with censat link
censat_pre_release = censat_pre_release.rename(columns={"location": "cenSatAnnotations"})

# replace s3 link with https
censat_pre_release['cenSatAnnotations'] = censat_pre_release['cenSatAnnotations'].str.replace('s3://','https://s3-us-west-2.amazonaws.com/')

# make two tables one for hap1 and one for hap2
censat_pre_release_hap1 = censat_pre_release[censat_pre_release['haplotype'] == 'hap1']
censat_pre_release_hap2 = censat_pre_release[censat_pre_release['haplotype'] == 'hap2']

# Merging the DataFrames on 'sample_id'
censat_pre_release_merged = pd.merge(censat_pre_release_hap1,
                                     censat_pre_release_hap2,
                                     on='sample_id',
                                     suffixes=('_hap1', '_hap2'))
censat_pre_release_merged = censat_pre_release_merged[["sample_id",
                                                       "cenSatAnnotations_hap1",
                                                       "cenSatAnnotations_hap2"]]
censat_pre_release_merged.head()

Unnamed: 0,sample_id,cenSatAnnotations_hap1,cenSatAnnotations_hap2
0,HG00408,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...
1,HG00597,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...
2,HG01192,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...
3,HG01261,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...
4,HG02015,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...


In [28]:
# merge new samples for hifi and ont
new_samples_union = list(set(new_samples_hifi).union(set(new_samples_ont)))
censat_table_batch3 = censat_pre_release_merged[censat_pre_release_merged['sample_id'].isin(new_samples_union)]
print(f"We have {len(new_samples_union)} new samples in total")
print(f"Number of censat rows: {len(censat_table_batch3)}")

We have 36 new samples in total
Number of censat rows: 36


In [29]:
# fixed index
censat_table_batch3.index = np.arange(len(censat_table_batch3))

In [30]:
def addDiploidCenSatAnnotation(censat_table_diploid, diploid_censat_dir, create_files):
    # add a column for saving diploid censat bed files
    censat_table_diploid["censat_diploid_bed"] = ""

    # make a directory for saving diploid censat bed files
    !mkdir -p {diploid_censat_dir}

    censat_diploid_list = []
    additional_annotations_array_list = []
    # iterate over rows
    for i in range(len(censat_table_diploid)):
        sample = censat_table_diploid["sample_id"][i]
        censat_bed_hap1 = censat_table_diploid["cenSatAnnotations_hap1"][i]
        censat_bed_hap2 = censat_table_diploid["cenSatAnnotations_hap2"][i]
        censat_bed_hap1_name = os.path.basename(censat_bed_hap1)
        censat_bed_hap2_name = os.path.basename(censat_bed_hap2)

        if create_files:
            # download censat files
            !cd {diploid_censat_dir} && wget {censat_bed_hap1}
            !cd {diploid_censat_dir} && wget {censat_bed_hap2}
        
            # concat hap1 and hap2 censat bed files into a single bed file
            !cat {diploid_censat_dir}/{censat_bed_hap1_name} {diploid_censat_dir}/{censat_bed_hap2_name} | bedtools sort -i - > {diploid_censat_dir}/{sample}_dip_hprc_r2_v1.cenSat.bed
            !cat {diploid_censat_dir}/{censat_bed_hap1_name} {diploid_censat_dir}/{censat_bed_hap2_name} | bedtools sort -i - | grep -i "rDNA" | awk '{{print $$1"\t"$$2"\t"$$3}}' > {diploid_censat_dir}/{sample}_dip_hprc_r2_v1.cenSat.rDNA.bed

        # add new bed to the table
        censat_diploid_list.append(f'{diploid_censat_dir}/{sample}_dip_hprc_r2_v1.cenSat.bed')
        # just adding rDNA annotation as an additional annotation
        additional_annotations_array_list.append([f'{diploid_censat_dir}/{sample}_dip_hprc_r2_v1.cenSat.rDNA.bed'])

    censat_table_diploid["censat_diploid_bed"] = censat_diploid_list
    censat_table_diploid["additional_annotations_array"] = additional_annotations_array_list
    #censat_table_diploid.head()
    return censat_table_diploid

In [31]:
diploid_censat_dir_batch3 = f'{BASE_DIR}/batch3/hmm_flagger/diploid_censat_beds'
censat_table_diploid_batch3 = addDiploidCenSatAnnotation(censat_table_diploid = censat_table_batch3,
                                                         diploid_censat_dir = diploid_censat_dir_batch3,
                                                         create_files = True)

--2025-03-10 11:29:23--  https://s3-us-west-2.amazonaws.com/human-pangenomics/submissions/DC27718F-5F38-43B0-9A78-270F395F13E8--INT_ASM_PRODUCTION/HG01109/assemblies/freeze_2/annotation/censat/HG01109_pat_hprc_r2_v1.cenSat.bed
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.92.129.136, 52.92.194.104, 52.92.227.176, ...
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.92.129.136|:443... 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


connected.
HTTP request sent, awaiting response... 200 OK
Length: 253950 (248K) [binary/octet-stream]
Saving to: ‘HG01109_pat_hprc_r2_v1.cenSat.bed.1’


2025-03-10 11:29:23 (3.93 MB/s) - ‘HG01109_pat_hprc_r2_v1.cenSat.bed.1’ saved [253950/253950]

--2025-03-10 11:29:23--  https://s3-us-west-2.amazonaws.com/human-pangenomics/submissions/DC27718F-5F38-43B0-9A78-270F395F13E8--INT_ASM_PRODUCTION/HG01109/assemblies/freeze_2/annotation/censat/HG01109_mat_hprc_r2_v1.cenSat.bed
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.92.227.176, 52.92.129.136, 52.218.252.120, ...
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.92.227.176|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 252606 (247K) [binary/octet-stream]
Saving to: ‘HG01109_mat_hprc_r2_v1.cenSat.bed.1’


2025-03-10 11:29:23 (2.30 MB/s) - ‘HG01109_mat_hprc_r2_v1.cenSat.bed.1’ saved [252606/252606]

--2025-03-10 11:29:24--  https://s3-us-west-2.amazonaws.com/human-

HTTP request sent, awaiting response... 200 OK
Length: 256064 (250K) [binary/octet-stream]
Saving to: ‘HG02818_mat_hprc_r2_v1.cenSat.bed.1’


2025-03-10 11:29:31 (2.97 MB/s) - ‘HG02818_mat_hprc_r2_v1.cenSat.bed.1’ saved [256064/256064]

--2025-03-10 11:29:31--  https://s3-us-west-2.amazonaws.com/human-pangenomics/submissions/DC27718F-5F38-43B0-9A78-270F395F13E8--INT_ASM_PRODUCTION/HG03098/assemblies/freeze_2/annotation/censat/HG03098_pat_hprc_r2_v1.cenSat.bed
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.92.186.24, 52.92.237.104, 52.92.213.96, ...
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.92.186.24|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 272014 (266K) [binary/octet-stream]
Saving to: ‘HG03098_pat_hprc_r2_v1.cenSat.bed.1’


2025-03-10 11:29:32 (3.20 MB/s) - ‘HG03098_pat_hprc_r2_v1.cenSat.bed.1’ saved [272014/272014]

--2025-03-10 11:29:32--  https://s3-us-west-2.amazonaws.com/human-pangenomics/sub

HTTP request sent, awaiting response... 200 OK
Length: 251235 (245K) [binary/octet-stream]
Saving to: ‘NA20752_hap1_hprc_r2_v1.0.1.cenSat.bed.1’


2025-03-10 11:29:39 (3.85 MB/s) - ‘NA20752_hap1_hprc_r2_v1.0.1.cenSat.bed.1’ saved [251235/251235]

--2025-03-10 11:29:39--  https://s3-us-west-2.amazonaws.com/human-pangenomics/submissions/DC27718F-5F38-43B0-9A78-270F395F13E8--INT_ASM_PRODUCTION/NA20752/assemblies/freeze_2/annotation/censat/NA20752_hap2_hprc_r2_v1.0.1.cenSat.bed
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.92.190.232, 52.218.133.48, 52.92.211.64, ...
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.92.190.232|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 256739 (251K) [binary/octet-stream]
Saving to: ‘NA20752_hap2_hprc_r2_v1.0.1.cenSat.bed.1’


2025-03-10 11:29:40 (2.97 MB/s) - ‘NA20752_hap2_hprc_r2_v1.0.1.cenSat.bed.1’ saved [256739/256739]

--2025-03-10 11:29:40--  https://s3-us-west-2.amazonaw

HTTP request sent, awaiting response... 200 OK
Length: 248686 (243K) [binary/octet-stream]
Saving to: ‘HG00733_mat_hprc_r2_v1.0.1.cenSat.bed.1’


2025-03-10 11:29:47 (2.85 MB/s) - ‘HG00733_mat_hprc_r2_v1.0.1.cenSat.bed.1’ saved [248686/248686]

--2025-03-10 11:29:48--  https://s3-us-west-2.amazonaws.com/human-pangenomics/submissions/DC27718F-5F38-43B0-9A78-270F395F13E8--INT_ASM_PRODUCTION/HG01243/assemblies/freeze_2/annotation/censat/HG01243_pat_hprc_r2_v1.0.1.cenSat.bed
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.92.144.48, 52.92.227.32, 52.218.237.16, ...
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.92.144.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 256984 (251K) [binary/octet-stream]
Saving to: ‘HG01243_pat_hprc_r2_v1.0.1.cenSat.bed.1’


2025-03-10 11:29:48 (2.33 MB/s) - ‘HG01243_pat_hprc_r2_v1.0.1.cenSat.bed.1’ saved [256984/256984]

--2025-03-10 11:29:48--  https://s3-us-west-2.amazonaws.com/h

HTTP request sent, awaiting response... 200 OK
Length: 270728 (264K) [binary/octet-stream]
Saving to: ‘NA18943_hap1_hprc_r2_v1.0.1.cenSat.bed.1’


2025-03-10 11:29:55 (4.05 MB/s) - ‘NA18943_hap1_hprc_r2_v1.0.1.cenSat.bed.1’ saved [270728/270728]

--2025-03-10 11:29:56--  https://s3-us-west-2.amazonaws.com/human-pangenomics/submissions/DC27718F-5F38-43B0-9A78-270F395F13E8--INT_ASM_PRODUCTION/NA18943/assemblies/freeze_2/annotation/censat/NA18943_hap2_hprc_r2_v1.0.1.cenSat.bed
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.92.179.144, 52.92.186.248, 52.92.147.0, ...
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.92.179.144|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 247443 (242K) [binary/octet-stream]
Saving to: ‘NA18943_hap2_hprc_r2_v1.0.1.cenSat.bed.1’


2025-03-10 11:29:56 (2.86 MB/s) - ‘NA18943_hap2_hprc_r2_v1.0.1.cenSat.bed.1’ saved [247443/247443]

--2025-03-10 11:29:57--  https://s3-us-west-2.amazonaws

HTTP request sent, awaiting response... 200 OK
Length: 244450 (239K) [binary/octet-stream]
Saving to: ‘NA18960_hap2_hprc_r2_v1.0.1.cenSat.bed.1’


2025-03-10 11:30:04 (2.84 MB/s) - ‘NA18960_hap2_hprc_r2_v1.0.1.cenSat.bed.1’ saved [244450/244450]

--2025-03-10 11:30:04--  https://s3-us-west-2.amazonaws.com/human-pangenomics/submissions/DC27718F-5F38-43B0-9A78-270F395F13E8--INT_ASM_PRODUCTION/NA18967/assemblies/freeze_2/annotation/censat/NA18967_hap1_hprc_r2_v1.0.1.cenSat.bed
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.92.228.16, 52.92.224.128, 52.92.128.120, ...
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.92.228.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 282518 (276K) [binary/octet-stream]
Saving to: ‘NA18967_hap1_hprc_r2_v1.0.1.cenSat.bed.1’


2025-03-10 11:30:04 (3.71 MB/s) - ‘NA18967_hap1_hprc_r2_v1.0.1.cenSat.bed.1’ saved [282518/282518]

--2025-03-10 11:30:05--  https://s3-us-west-2.amazonaws

HTTP request sent, awaiting response... 200 OK
Length: 267828 (262K) [binary/octet-stream]
Saving to: ‘NA20806_hap1_hprc_r2_v1.0.1.cenSat.bed.1’


2025-03-10 11:30:12 (4.07 MB/s) - ‘NA20806_hap1_hprc_r2_v1.0.1.cenSat.bed.1’ saved [267828/267828]

--2025-03-10 11:30:12--  https://s3-us-west-2.amazonaws.com/human-pangenomics/submissions/DC27718F-5F38-43B0-9A78-270F395F13E8--INT_ASM_PRODUCTION/NA20806/assemblies/freeze_2/annotation/censat/NA20806_hap2_hprc_r2_v1.0.1.cenSat.bed
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.92.234.232, 52.92.132.80, 52.92.179.88, ...
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.92.234.232|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 247181 (241K) [binary/octet-stream]
Saving to: ‘NA20806_hap2_hprc_r2_v1.0.1.cenSat.bed.1’


2025-03-10 11:30:12 (2.83 MB/s) - ‘NA20806_hap2_hprc_r2_v1.0.1.cenSat.bed.1’ saved [247181/247181]

--2025-03-10 11:30:13--  https://s3-us-west-2.amazonaws

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [32]:
censat_table_diploid_batch3.head()

Unnamed: 0,sample_id,cenSatAnnotations_hap1,cenSatAnnotations_hap2,censat_diploid_bed,additional_annotations_array
0,HG01109,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...
1,HG02055,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...
2,HG02080,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...
3,HG02109,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...
4,HG02723,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...


## Merge censat and assembly tables

In [33]:
assembly_and_censat_table_batch3 = pd.merge(assembly_pre_release_diploid, 
                                            censat_table_diploid_batch3, 
                                            on='sample_id',  
                                            how='inner')

In [34]:
assembly_and_censat_table_batch3.head()

Unnamed: 0,sample_id,assembly_hap1,assembly_hap2,cenSatAnnotations_hap1,cenSatAnnotations_hap2,censat_diploid_bed,additional_annotations_array
0,NA20752,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...
1,HG02145,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...
2,HG02027,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...
3,NA19159,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...
4,HG01786,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...


## Make a HiFi data table

In [35]:
print("Assembly+Censat Table {} out of {} exist for HiFi".format(len(set(assembly_and_censat_table_batch3['sample_id']).intersection(new_samples_hifi)),
                                                                  len(new_samples_hifi)))

Assembly+Censat Table 19 out of 19 exist for HiFi


In [42]:
print("These samples are missing for HiFi")
missing_samples = set(new_samples_hifi).difference(set(merged_hifi_full_table['sample_id']))
missing_samples

These samples are missing for HiFi


{'HG005', 'NA20503', 'NA20762', 'NA20806', 'NA20827'}

In [37]:
hifi_data_table_batch3 = pd.merge(assembly_and_censat_table_batch3, 
                                  merged_hifi_full_table, 
                                  on='sample_id',  
                                  how='inner')
hifi_data_table_batch3 = hifi_data_table_batch3[hifi_data_table_batch3['sample_id'].isin(new_samples_hifi)]

In [38]:
hifi_data_table_batch3.index = np.arange(len(hifi_data_table_batch3))

In [39]:
print("Number of rows in the final data table for HiFi : ", len(hifi_data_table_batch3))

Number of rows in the final data table for HiFi :  14


In [40]:
columns_to_keep = ['sample_id',
                   'assembly_hap1', 
                   'assembly_hap2',
                   'cenSatAnnotations_hap1',
                   'cenSatAnnotations_hap2',
                   'censat_diploid_bed',
                   'additional_annotations_array', 'read_files',
                   'coverage','platform', 'total_coverage', 'number_of_read_files',
                   'number_of_cores_per_task', 'mapper_preset', 'kmer_size',
                   'hmm_flagger_preset', 'read_files_downsampled',
                   'total_coverage_downsampled', 'number_of_read_files_downsampled',
                   'number_of_cores_per_task_downsampled']

In [41]:
hifi_data_table_batch3 = hifi_data_table_batch3[columns_to_keep]

## Make a HiFi (DeepConsensus) data table

Those 5 missing samples should exist in the DeepConsensus table.

In [48]:
hifi_dc_data_table_batch3 = pd.merge(assembly_and_censat_table_batch3, 
                                     merged_hifi_dc_full_table, 
                                     on='sample_id',  
                                     how='inner')
hifi_dc_data_table_batch3 = hifi_dc_data_table_batch3[hifi_dc_data_table_batch3['sample_id'].isin(missing_samples)]

In [52]:
hifi_dc_data_table_batch3 = hifi_dc_data_table_batch3[columns_to_keep]
hifi_dc_data_table_batch3.index = np.arange(len(hifi_dc_data_table_batch3))

In [53]:
hifi_dc_data_table_batch3

Unnamed: 0,sample_id,assembly_hap1,assembly_hap2,cenSatAnnotations_hap1,cenSatAnnotations_hap2,censat_diploid_bed,additional_annotations_array,read_files,coverage,platform,total_coverage,number_of_read_files,number_of_cores_per_task,mapper_preset,kmer_size,hmm_flagger_preset,read_files_downsampled,total_coverage_downsampled,number_of_read_files_downsampled,number_of_cores_per_task_downsampled
0,HG005,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...,[s3://human-pangenomics/submissions/FFC78D9F-2...,"[8.4, 7.7, 6.6, 5.0, 10.2, 9.6, 9.4]","[nan, nan, nan, nan, nan, nan, nan]",56.9,7,9,lr:hqae,25,hifi,[s3://human-pangenomics/submissions/FFC78D9F-2...,56.9,7,9
1,NA20503,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...,[s3://human-pangenomics/submissions/82BE5FDF-1...,[41.7],[nan],41.7,1,64,lr:hqae,25,hifi,[s3://human-pangenomics/submissions/82BE5FDF-1...,41.7,1,64
2,NA20762,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...,[s3://human-pangenomics/submissions/82BE5FDF-1...,[45.7],[nan],45.7,1,64,lr:hqae,25,hifi,[s3://human-pangenomics/submissions/82BE5FDF-1...,45.7,1,64
3,NA20806,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...,[s3://human-pangenomics/submissions/82BE5FDF-1...,[47.0],[nan],47.0,1,64,lr:hqae,25,hifi,[s3://human-pangenomics/submissions/82BE5FDF-1...,47.0,1,64
4,NA20827,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...,[s3://human-pangenomics/submissions/82BE5FDF-1...,[48.9],[nan],48.9,1,64,lr:hqae,25,hifi,[s3://human-pangenomics/submissions/82BE5FDF-1...,48.9,1,64


In [56]:
# concat deepconsensus-only samples to the hifi table
hifi_data_table_batch3 = pd.concat([hifi_dc_data_table_batch3, hifi_data_table_batch3])
hifi_data_table_batch3.index = np.arange(len(hifi_data_table_batch3))
hifi_data_table_batch3

Unnamed: 0,sample_id,assembly_hap1,assembly_hap2,cenSatAnnotations_hap1,cenSatAnnotations_hap2,censat_diploid_bed,additional_annotations_array,read_files,coverage,platform,total_coverage,number_of_read_files,number_of_cores_per_task,mapper_preset,kmer_size,hmm_flagger_preset,read_files_downsampled,total_coverage_downsampled,number_of_read_files_downsampled,number_of_cores_per_task_downsampled
0,HG005,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...,[s3://human-pangenomics/submissions/FFC78D9F-2...,"[8.4, 7.7, 6.6, 5.0, 10.2, 9.6, 9.4]","[nan, nan, nan, nan, nan, nan, nan]",56.9,7,9,lr:hqae,25,hifi,[s3://human-pangenomics/submissions/FFC78D9F-2...,56.9,7,9
1,NA20503,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...,[s3://human-pangenomics/submissions/82BE5FDF-1...,[41.7],[nan],41.7,1,64,lr:hqae,25,hifi,[s3://human-pangenomics/submissions/82BE5FDF-1...,41.7,1,64
2,NA20762,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...,[s3://human-pangenomics/submissions/82BE5FDF-1...,[45.7],[nan],45.7,1,64,lr:hqae,25,hifi,[s3://human-pangenomics/submissions/82BE5FDF-1...,45.7,1,64
3,NA20806,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...,[s3://human-pangenomics/submissions/82BE5FDF-1...,[47.0],[nan],47.0,1,64,lr:hqae,25,hifi,[s3://human-pangenomics/submissions/82BE5FDF-1...,47.0,1,64
4,NA20827,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...,[s3://human-pangenomics/submissions/82BE5FDF-1...,[48.9],[nan],48.9,1,64,lr:hqae,25,hifi,[s3://human-pangenomics/submissions/82BE5FDF-1...,48.9,1,64
5,HG02027,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...,[s3://human-pangenomics/working/HPRC/HG02027/r...,"[9.2, 8.9, 11.1, 10.3, 10.8, 10.6]","[PACBIO_SMRT, PACBIO_SMRT, PACBIO_SMRT, PACBIO...",60.9,6,10,lr:hqae,25,hifi,[s3://human-pangenomics/working/HPRC/HG02027/r...,60.9,6,10
6,HG01786,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...,[s3://human-pangenomics/working/HPRC/HG01786/r...,"[36.5, 37.1]","[PACBIO_SMRT, PACBIO_SMRT]",73.6,2,32,lr:hqae,25,hifi,[s3://human-pangenomics/working/HPRC/HG01786/r...,73.6,2,32
7,HG00733,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...,[s3://human-pangenomics/working/HPRC_PLUS/HG00...,"[13.4, 12.7, 13.9, 12.9, 13.2]","[nan, nan, nan, nan, nan]",66.1,5,12,lr:hqae,25,hifi,[s3://human-pangenomics/working/HPRC_PLUS/HG00...,66.1,5,12
8,NA18940,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...,[s3://human-pangenomics/submissions/B25289BC-5...,"[9.2, 11.1, 9.5, 10.2]","[nan, nan, nan, nan]",40.0,4,16,lr:hqae,25,hifi,[s3://human-pangenomics/submissions/B25289BC-5...,40.0,4,16
9,NA18943,s3://human-pangenomics/submissions/DC27718F-5F...,s3://human-pangenomics/submissions/DC27718F-5F...,https://s3-us-west-2.amazonaws.com/human-pange...,https://s3-us-west-2.amazonaws.com/human-pange...,/private/groups/hprc/qc_hmm_flagger/hprc_inter...,[/private/groups/hprc/qc_hmm_flagger/hprc_inte...,[s3://human-pangenomics/submissions/B25289BC-5...,"[9.5, 11.0, 10.1, 10.1]","[nan, nan, nan, nan]",40.7,4,16,lr:hqae,25,hifi,[s3://human-pangenomics/submissions/B25289BC-5...,40.7,4,16


## Make a ONT data table

In [57]:
print("Assembly+Censat Table {} out of {} exist for ONT".format(len(set(assembly_and_censat_table_batch3['sample_id']).intersection(new_samples_ont)),
                                                                len(new_samples_ont)))

Assembly+Censat Table 36 out of 36 exist for ONT


In [58]:
print("These samples are missing for ONT")
set(new_samples_ont).difference(set(merged_ont_full_table['sample_id']))

These samples are missing for ONT


set()

In [59]:
ont_data_table_batch3 = pd.merge(assembly_and_censat_table_batch3, 
                                 merged_ont_full_table, 
                                 on='sample_id',  
                                 how='inner')
ont_data_table_batch3 = ont_data_table_batch3[ont_data_table_batch3['sample_id'].isin(new_samples_ont)]

In [60]:
ont_data_table_batch3.index = np.arange(len(ont_data_table_batch3))

In [61]:
print("Number of rows in the final data table for ONT : ", len(ont_data_table_batch3))

Number of rows in the final data table for ONT :  36


In [62]:
columns_to_kepp = ['sample_id', 
 'assembly_hap1', 
 'assembly_hap2', 
 'cenSatAnnotations_hap1',
 'cenSatAnnotations_hap2',
 'censat_diploid_bed',
 'additional_annotations_array',
 'read_files','coverage',
 'sequencing_chemistry', 'total_coverage', 'number_of_read_files',
 'number_of_cores_per_task', 'mapper_preset', 'kmer_size',
 'hmm_flagger_preset', 'read_files_downsampled',
 'total_coverage_downsampled', 'number_of_read_files_downsampled',
 'number_of_cores_per_task_downsampled']

In [63]:
ont_data_table_batch3 = ont_data_table_batch3[columns_to_keep] 

## Save the final data tables

In [64]:
os.makedirs("ont", exist_ok=True)
ont_data_table_batch3.to_csv('ont/hmm_flagger_ont_data_table.csv', index=False)

os.makedirs("hifi", exist_ok=True)
hifi_data_table_batch3.to_csv('hifi/hmm_flagger_hifi_data_table.csv', index=False)