# Downsampling WGS Reads

This notebook walks through the steps of downsampling reads for the purpose of testing PHG imputation. It uses seqtk to downsample paired end reads in a particular directory and outputs those downsampled reads (at 1x, 0.1x and 0.01x coverage) to new folders of choice.

In [1]:
import os
import random

In [2]:
# the working directory
working_dir = "/workdir/ahb232/phg_sorghum_apr2023/"

# folder containing the original wgs data
original_wgs = working_dir + "/WGS/sorghumbase/"

# folders for each set of downsampled data
coverage_1 = working_dir + "/WGS/sorghumbase/coverage_1x/"
coverage_01 = working_dir + "/WGS/sorghumbase/coverage_0.1x/"
coverage_001 = working_dir + "/WGS/sorghumbase/coverage_0.01x/"

In [3]:
# approximate genome length and read length are required to determine the 
# number of reads needed to simulate 1x, 0.1x and 0.01x coverage

genome_length = 716928177
read_length = 150

single_coverage = round(genome_length / read_length / 2)
tenth_coverage = round(single_coverage / 10)
hundreth_coverage = round(single_coverage / 100)

In [4]:
# loop through each file in the original wgs folder
# NOTE: this script assumes that you are using paired-end reads
# and that the pairs are named in the following format:
# <some identifier>_1.fq.gz
# <some identifier>_2.fq.gz
# If a different naming scheme is used this script must be modified
# this script also uses the same random seed for both files
# so that pairs are maintained
# a new seed is chosen for each pair and downsampling level
for file in os.listdir(original_wgs):
    if file.endswith("_1.fq.gz"):
        
        # replace 1 with 2 to get second paired read file
        file2 = file[:-7] + "2" + file[-6:]
        
        file_path = original_wgs + file
        file2_path = original_wgs + file2
        
        print(file)
        
        # 1x coverage
        seed = random.randint(1, 1000)
        print("-- 1x coverage seed: " + str(seed))
        
        out_path = coverage_1 + "coverage_1x_" + file[:-3]
        out2_path = coverage_1 + "coverage_1x_" + file2[:-3]
        
        ! seqtk sample -s {seed} {file_path} {single_coverage} > {out_path}
        ! seqtk sample -s {seed} {file2_path} {single_coverage} > {out2_path}
        
        ! gzip {out_path}
        ! gzip {out2_path}
        
        #0.1x coverage
        
        seed = random.randint(1, 1000)
        print("-- 0.1x coverage seed: " + str(seed))
        
        out_path = coverage_01 + "coverage_0.1x_" + file[:-3]
        out2_path = coverage_01 + "coverage_0.1x_" + file2[:-3]
        
        ! seqtk sample -s {seed} {file_path} {tenth_coverage} > {out_path}
        ! seqtk sample -s {seed} {file2_path} {tenth_coverage} > {out2_path}
    
        ! gzip {out_path}
        ! gzip {out2_path}
        
        #0.01x coverage
        
        seed = random.randint(1, 1000)
        print("-- 0.01x coverage seed: " + str(seed))
        
        out_path = coverage_001 + "coverage_0.01x_" + file[:-3]
        out2_path = coverage_001 + "coverage_0.01x_" + file2[:-3]
        
        ! seqtk sample -s {seed} {file_path} {hundreth_coverage} > {out_path}
        ! seqtk sample -s {seed} {file2_path} {hundreth_coverage} > {out2_path}
        
        ! gzip {out_path}
        ! gzip {out2_path}
        

IS3614-3_270_1.fq.gz
-- 1x coverage seed: 230
-- 0.1x coverage seed: 84
-- 0.01x coverage seed: 123
