**@author: James V. Talwar**<br>

# Generating VADEr Feature Patches:

**About:** This notebook generates <font color='red'>**VADEr**</font> (SNP) feature patches as a f(x) of a user-defined patch window length (W)/patch size (`patch_window_length`), which is equivalent to the desired divided by 2. 

For example, below in this notebook, `patch_window_length = 2.5e5` (250kb), but if other patch window lengths are desired, the user should feel free to update this parameter below.

In [1]:
import pandas as pd
import os
from collections import defaultdict
import numpy as np
import joblib
import logging
logging.getLogger().setLevel(logging.INFO)

In [2]:
logger = logging.getLogger()
console = logging.StreamHandler()
logger.addHandler(console)

**USER: Update the following paths/parameters as required for your purposes:**
 - `patch_window_length`: Desired patch window size for VADEr features
 - `feature_directory`: Path to all candidate feature sets
    - **NB:** Here, the assumption is that all candidate feature sets have the following naming convention -> dataset.p-value.extract.txt (as one might use for formulating extracts with PLINK(2). If you want to just run this notebook without any issues, ensure you follow this convention. If you prefer a different structure, you will need to adapt the `snpSets = ...` below.
 - `write_path`: Path/directory to which want to write all needed patch formulation files. It is expected that `write_path` will have two directories within it: `Patches_To_Features` and `Patch_To_Chrom_Mapping`, each of which will have a folder `patch_window_length`kb (e.g. `250kb`) to which files will be written. To modify this behavior, adapt the final cell of this notebook to which file writes are directed.
 - `chromosome_length_file`: Path to file with mapping of each chromosome to chromosome length. Hg19/GRCh37 chromosome sizes can be found at our [DARTH_VADEr repository](https://github.com/jvtalwar/DARTH_VADEr/tree/main) at `VADEr/Data/Reference_Genome_Build_Sizes/hg19.chrom.sizes`

In [3]:
patch_window_length = 2.5e5 
feature_directory = "../../Data/snps/extract/PC" 
write_path = "../../Data/Feature_Patches"
chromosome_length_file = "../../Data/Reference_Genome_Build_Sizes/hg19.chrom.sizes"

### Defining Patch Spacing 
 - Obtaining midpoints of each patch

In [4]:
patch_radius = patch_window_length/2 #generate a patch_radius from patch window (similar to Plink's --clump-kb parameter in clumping)

In [5]:
# Load in chromosome lengths/sizes
chromosome_lengths = pd.read_csv(chromosome_length_file, sep = "\t", header = None)
chromosome_lengths.columns = ["CHR", "LENGTH"]
iteratables = ["chr" + str(i) for i in range(1,23)] + ["chrX", "chrY"]
chromLengthMap = {k:v for k,v in dict(zip(chromosome_lengths.CHR, chromosome_lengths.LENGTH)).items() if k in iteratables}
chromLengthMap

{'chr1': 249250621,
 'chr2': 243199373,
 'chr3': 198022430,
 'chr4': 191154276,
 'chr5': 180915260,
 'chr6': 171115067,
 'chr7': 159138663,
 'chr8': 146364022,
 'chr9': 141213431,
 'chr10': 135534747,
 'chr11': 135006516,
 'chr12': 133851895,
 'chr13': 115169878,
 'chr14': 107349540,
 'chr15': 102531392,
 'chr16': 90354753,
 'chr17': 81195210,
 'chr18': 78077248,
 'chr19': 59128983,
 'chr20': 63025520,
 'chr21': 48129895,
 'chr22': 51304566,
 'chrX': 155270560,
 'chrY': 59373566}

In [6]:
chromValidPartitions = defaultdict(list)
for chrom, length in chromLengthMap.items():
    numChromRegions = length//(patch_radius * 2)
    i = 0
    midpoint = patch_radius
    while midpoint <= length:
        chromValidPartitions[chrom].append(midpoint)
        i += 1
        midpoint = i*2*patch_radius + patch_radius
    
    # A final valid partition can be excluded when the distance between the chr length and the final given partition is < 2 * plink window. 
    # In this case need to add on a final partition == chr length which encapsulates length - plink window  
    if (length - ((i-1)*2*patch_radius + patch_radius)) > patch_radius: 
        chromValidPartitions[chrom].append(length)

In [7]:
#Validate (i.e., ensure no issues with final patch assignment for each chromosome):
for c,l in chromLengthMap.items():
    right_distance = l - max(chromValidPartitions[c])
    if right_distance < 0 or right_distance > patch_radius:
        logging.warning("WARNING: Discrepancy issue at chrom {}: dist between end chrom and final window center point {}".format(c, right_distance))
        
    else:
        logging.info("{} right distance encapsulated by final partition: {}".format(c, right_distance))
        logging.info("{} left distance encapsulated by final partition: {}\n".format(c, max(chromValidPartitions[c]) - (chromValidPartitions[c][-2] + patch_radius)))

chr1 right distance encapsulated by final partition: 0
chr1 left distance encapsulated by final partition: 621.0

chr2 right distance encapsulated by final partition: 74373.0
chr2 left distance encapsulated by final partition: 125000.0

chr3 right distance encapsulated by final partition: 0
chr3 left distance encapsulated by final partition: 22430.0

chr4 right distance encapsulated by final partition: 29276.0
chr4 left distance encapsulated by final partition: 125000.0

chr5 right distance encapsulated by final partition: 40260.0
chr5 left distance encapsulated by final partition: 125000.0

chr6 right distance encapsulated by final partition: 0
chr6 left distance encapsulated by final partition: 115067.0

chr7 right distance encapsulated by final partition: 13663.0
chr7 left distance encapsulated by final partition: 125000.0

chr8 right distance encapsulated by final partition: 0
chr8 left distance encapsulated by final partition: 114022.0

chr9 right distance encapsulated by final pa

Identify all candidate feature sets under investigation and extract/format needed information

In [8]:
snpSets = {file.split(".")[1]:pd.read_csv(os.path.join(feature_directory, file), header = None, dtype = str)[0].tolist() for file in os.listdir(feature_directory) if "extract" in file}

In [9]:
gottaCatchEmAllPlinkymon = defaultdict(list)
for k,v in snpSets.items():
    logging.info("Extracting and processing SNPs at p-value threshold of {}\n".format(k))
    for snp in v:
        assert "rs" not in snp
        chrom, basePair, ref, alt = snp.split(":")
        gottaCatchEmAllPlinkymon[k].append([chrom, snp, basePair])

Extracting and processing SNPs at p-value threshold of 5e-08

Extracting and processing SNPs at p-value threshold of 5e-07

Extracting and processing SNPs at p-value threshold of 5e-05

Extracting and processing SNPs at p-value threshold of 5e-06

Extracting and processing SNPs at p-value threshold of 5e-04



In [10]:
#convert dictionary to dataframe: 
pokedex = {k:pd.DataFrame(v, columns = ["CHR", "SNP", "BP"]).sort_values(by = ["CHR", "BP"]) for k,v in gottaCatchEmAllPlinkymon.items()}
pokedex["5e-04"].head()

Unnamed: 0,CHR,SNP,BP
52704,1,1:10008013:G:T,10008013
52705,1,1:10008365:G:T,10008365
52706,1,1:10008526:T:C,10008526
52756,1,1:10012311:G:A,10012311
52757,1,1:10014173:C:G,10014173


Validate/Sanity Check: Ensure no SNP BP exceeds the chromosome length

In [11]:
for k, df in pokedex.items():
    chromosomes = sorted(list(set(df.CHR)))
    for charmander in chromosomes:
        v = df[df.CHR == charmander] #subset to chromosome
        basePairs = list(v.BP.astype(int))
        if charmander == 23:
            charmander = "X"
        if max(basePairs) > chromLengthMap["chr" + str(charmander)]:
            logging.warning("WARNING {} CHR {}: Highest SNP BP exceeds the valid chromosome range...".format(k, charmander))

Add column for each SNP set DF pertaining to its chromosomal patch (as defined by the `patch_radius`)

In [12]:
'''
Input(s): 
1) chromPartitions - A dictionary of centered valid chromosomal partitions with a radius of patch_radius 
2) clumpSummaryDF - A dataframe of sorted (by CHR and BP) SNPs for a particular feature set 

Output(s):
1) A dataframe with a chromosomal level index mappings according to the allotted patch_radius - these are/will be/can be used for adding positional information to InSNPtion after clump compression
'''
def AssignSNPPatchIndex(chromPartitions, clumpSummaryDF):
    radius = chromPartitions["chr1"][0]
    logging.info("Radius of chromosomal partitions is {}".format(radius))
    chromosomes = sorted(list(set(clumpSummaryDF.CHR)))
    thatMakesYouChartMan = list() #positional mapping
    for i, row in clumpSummaryDF.iterrows():
        charmander = "chr" + str(row["CHR"])
        if charmander == "chr23":
            charmander = "chrX"
            
        clumpChromPosition =  np.abs(np.asarray(chromPartitions[charmander]) - int(row["BP"])).argmin()
        thatMakesYouChartMan.append(clumpChromPosition)
        
    clumpSummaryDF["ChromosomePatch"] = thatMakesYouChartMan
    
    clumpSummaryDF.CHR = clumpSummaryDF.CHR.apply(lambda x: "23" if x == "X" else x)
    clumpSummaryDF.CHR = clumpSummaryDF.CHR.astype(int)
    clumpSummaryDF.BP = clumpSummaryDF.BP.astype(int)
    
    return clumpSummaryDF.sort_values(by = ["ChromosomePatch"])

In [13]:
patch_mappings = dict()
for k,v in pokedex.items():
    logging.info("Processing SNP set {}".format(k))
    patch_mappings[k] = AssignSNPPatchIndex(chromPartitions = chromValidPartitions, clumpSummaryDF = v)

Processing SNP set 5e-08
Radius of chromosomal partitions is 125000.0
Processing SNP set 5e-07
Radius of chromosomal partitions is 125000.0
Processing SNP set 5e-05
Radius of chromosomal partitions is 125000.0
Processing SNP set 5e-06
Radius of chromosomal partitions is 125000.0
Processing SNP set 5e-04
Radius of chromosomal partitions is 125000.0


Generate and write VADEr patch mappings (for all candidate feature sets): **1) Patch feature dictionaries** and **2) Patch to chromosome locations** 

In [14]:
for k,v in patch_mappings.items():
    logging.info("Generating dictionaries for SNP set {}".format(k))
    pikaGroup = v.groupby(["CHR", "ChromosomePatch"])
    features = defaultdict(list)
    positionalInformation = defaultdict(lambda: defaultdict(int)) #convert to DF before saving - joblib hates lambda f(x)s
    clumpNum = 1
    chromInOrderCheck = list()
    patchInOrderCheck = list()
    prevKey = -1
    for keysGroupedBy, dataGroupedBy in pikaGroup:
        features["clump" + str(clumpNum)] = list(dataGroupedBy.sort_values(by = ["BP"]).SNP)
        positionalInformation["clump" + str(clumpNum)]["CHR"] = keysGroupedBy[0] 
        positionalInformation["clump" + str(clumpNum)]["ChromosomePatch"] = keysGroupedBy[1] 
        
        chromInOrderCheck.append(keysGroupedBy[0])
        
        if prevKey != keysGroupedBy[0]:
            prevKey = keysGroupedBy[0]
            if (len(patchInOrderCheck) == 0):
                patchInOrderCheck.append(keysGroupedBy[1])
            else:
                if (patchInOrderCheck != sorted(patchInOrderCheck)):
                    logging.warning("Chromosomal patches may be out of order")
                patchInOrderCheck = [keysGroupedBy[1]]
        else:
            patchInOrderCheck.append(keysGroupedBy[1])
        
        clumpNum += 1
    
    #type check
    chromsAsArray = np.array(chromInOrderCheck)
    isAnInt = np.issubdtype(chromsAsArray.dtype, np.integer)
    isAFloat = np.issubdtype(chromsAsArray.dtype, np.floating)
    
    assert isAnInt or isAFloat, f"Chromosomes not stored as int/floats and violate numerical ordering. Current dtype {np.array(chromInOrderCheck).dtype}"
    if (chromInOrderCheck != sorted(chromInOrderCheck)): 
        logging.warning("Patches may be out of order")
        
    #convert positional info into DF
    positionalInformation = pd.DataFrame(positionalInformation).T
    assert (positionalInformation.CHR == sorted(positionalInformation.CHR)).all(), "Chromosomes to patch ordering is askew..."
    
    #Save relevant structures:
    joblib.dump(features, os.path.join(write_path, "Patches_To_Features/{}kb".format(str(int(2*patch_radius//1e3))), k + "_{}kb_DistanceClumps.joblib".format(int(2*patch_radius/1000))))
    positionalInformation.to_csv(os.path.join(write_path, "Patch_To_Chrom_Mapping/{}kb".format(str(int(2*patch_radius//1e3))), k + "_{}kb_PositionalMaps.tsv".format(int(2*patch_radius/1000))), sep = "\t")


Generating dictionaries for SNP set 5e-08
Generating dictionaries for SNP set 5e-07
Generating dictionaries for SNP set 5e-05
Generating dictionaries for SNP set 5e-06
Generating dictionaries for SNP set 5e-04
