### File Explanation

**loadData_HBTRC.ipynb:**
<br> This notebook is to load raw HBTRC data in '.bim, .bed, and .bam' formats and turn into NumPy arrays and lists for computation advantage 

**Processes are as follows:**
<br>1) Load data from ".bed, .bim, and .fam" formats by using **"pandas_plink"** package
<br>2) Genes that are associated with SNPs (by using sSNP_Gene variable) are found and stored as **sGeneID_0_nG**.
<br>3) All transcription factors/targets (including Entrez ID and gene ID) in the dataset are identified by "gwasPreprocessing_v2_tfs.Rdata".
<br>4) Pairs of transcription factors and targets are stored in nPairs_lPlC (function: **tf_Pairs_Finding** in **loadData_utils.py**)
<br>5) Missing values in RNA expression data are replaced with average values of particular gene.
<br>6) Missing values in SNPs are replaced with 3s as encoding 3 represents the most common allele, "A/A".
<br>7) Genotypes, gene expression, probe location, SNP location .csv files are created for MatrixEQTL + LD analysis
<br> 8) Save them in "loadData_HBTRC.pickle" format to be called by "preprocessData_HBTRC.ipynb"

**Some abbreviations used:**
<br> **S:** number of subjects 
<br> **R:** number of transcripts
<br> **N:** number of SNPs
<br> **P:** number of pairs
<br> **C:** number of categories 

**Variables created for preprocessData:**
<br> 1) **rRna_nSR:** RNA (gene) expression levels, a NumPy array, shape of (S x R)
<br> 2) **rSnp_nSN:** SNP encodings, a NumPy array, shape of (S x N)
<br> 3) **nNs_lRlN:** SNP names for corresponding genes on which SNPs are, a list, length of (R) 
<br> 4) **nPairs_lPlC:** Known gene pairs, a list, length of (2204) with columns for "TF ID, TF Entrez ID, TF Gene Symbol, Target Entrez ID, Target Gene Symbol"
<br> 5) **nSs:** Subject names
<br> 6) **nRs:** Transcript (gene) names
<br> 7) **nNs:** SNPs names

**Files created for MatrixEQTL + LD analysis:**
<br> 1) **eqtl_expression.csv**: Gene expressions of all transcripts, csv format (rows: R, columns: S)
<br> 2) **eqtl_genotype.csv**: SNP encodings (1, 2, 3) of SNPs, csv format (rows: N, columns: S)
<br> 3) **eqtl_probe_loc.csv**: Gene locations, csv format (rows: R, columns: gene ID, chromosome number, start location, end location)
<br> 4) **eqtl_snp_loc.csv**: SNP locations, csv format (rows: N, columns: SNP ID, chromosome number, position)
<br> 5) **linkdis_genotype.csv**: SNP encodings (0, 1, 2) of SNPs, csv format (rows: N, columns: S)
<br> 6) **eqtl_subjects.csv**: Subject names, csv format (rows: S, columns: 1)
<br> 7) **eqtl_snps.csv**: SNP names, csv format (rows: N, columns: 1)
<br> 8) **eqtl_genes.csv**: Transcript names, csv format (rows: R, columns: 1)
<br> 9) **eqtl_snp_gene.csv**: SNP to gene mapping, csv format (rows: N, columns: SNP ID, gene ID)

In [1]:
# LOAD HBTRC STUDY DATA
from pandas_plink import read_plink1_bin, read_plink
import pandas as pd
# read_plink1_bin returns (dataArray in shape = (samples, variants))
## all information concatenated 
G =  read_plink1_bin("HBTRC data.tsv/Genotypes/AMP-AD_HBTRC_MSSM_IlluminaHumanHap650Y.bed",
                     "HBTRC data.tsv/Genotypes/AMP-AD_HBTRC_MSSM_IlluminaHumanHap650Y.bim",
                     "HBTRC data.tsv/Genotypes/AMP-AD_HBTRC_MSSM_IlluminaHumanHap650Y.fam", 
                     verbose=False)
                     
    
# read_plink returns (pandasDataFrame, pandasDataFrame, daskArray)
# vBim_pNC: allele, vFam_pSC: samples, vBed_aNS: genotype (snp)
(vBim_pNC, vFam_pSC, vBed_aNS) = read_plink("HBTRC data.tsv/Genotypes/AMP-AD_HBTRC_MSSM_IlluminaHumanHap650Y")
print( "Genotypes loaded" )

Mapping files: 100%|██████████| 3/3 [00:01<00:00,  1.48it/s]

Genotypes loaded





In [4]:
"""
Columns:

Chrom: Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name
SNP: Variant identifier
Cm: Position in morgans or centimorgans (safe to use dummy value of '0')
Pos: Base-pair coordinate (1-based; limited to 2^31-2)
A0: Allele 1 (corresponding to clear bits in .bed; usually minor)
A1: Allele 2 (corresponding to set bits in .bed; usually major)
"""
vBim_pNC.head(11)

Unnamed: 0,chrom,snp,cm,pos,a0,a1,i
0,1,rs3094315,0.0,752566,C,T,0
1,1,rs12562034,0.0,768448,A,G,1
2,1,rs3934834,0.0,1005806,T,C,2
3,1,rs9442372,0.0,1018704,A,G,3
4,1,rs3737728,0.0,1021415,T,C,4
5,1,rs9442398,0.0,1021695,A,G,5
6,1,rs6687776,0.0,1030565,T,C,6
7,1,rs4970405,0.0,1048955,G,A,7
8,1,rs12726255,0.0,1049950,G,A,8
9,1,rs11807848,0.0,1061166,C,T,9


In [5]:
vFam_pSC.head(10)

Unnamed: 0,fid,iid,father,mother,gender,trait,i
0,15876,15876,0,0,0,-9.0,0
1,15877,15877,0,0,0,-9.0,1
2,15878,15878,0,0,0,-9.0,2
3,15879,15879,0,0,0,-9.0,3
4,15880,15880,0,0,0,-9.0,4
5,15881,15881,0,0,0,-9.0,5
6,15882,15882,0,0,0,-9.0,6
7,15883,15883,0,0,0,-9.0,7
8,15884,15884,0,0,0,-9.0,8
9,15885,15885,0,0,0,-9.0,9


In [6]:
# Make fid as index column
vFam_pSC.index = vFam_pSC.loc[ :, "fid" ]
vFam_pSC.tail()

Unnamed: 0_level_0,fid,iid,father,mother,gender,trait,i
fid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
16037,16037,16037,0,0,0,-9.0,735
16255,16255,16255,0,0,0,-9.0,736
16240,16240,16240,0,0,0,-9.0,737
16056,16056,16056,0,0,0,-9.0,738
21820,21820,21820,0,0,0,-9.0,739


In [7]:
# The values of the bed matrix denote how many alleles a1 are in the corresponding position and individual
vBed_nSN = vBed_aNS.compute().transpose()
vBed_nSN.shape

(740, 672838)

In [8]:
vBed_nSN

array([[2., 2., 1., ..., 1., 2., 2.],
       [2., 2., 2., ..., 1., 2., 2.],
       [2., 2., 2., ..., 2., 2., 2.],
       ...,
       [2., 2., 1., ..., 2., 2., 2.],
       [1., 2., 2., ..., 2., 2., 2.],
       [2., 2., 2., ..., 2., 2., 2.]])

In [9]:
# Dimension checks
assert( vBim_pNC.shape[0] == vBed_nSN.shape[1] )
assert( vFam_pSC.shape[0] == vBed_nSN.shape[0] )

sSubjectsSnps_nN = vFam_pSC['fid'].to_numpy()
sSubjectsSnps_nN.shape

(740,)

## Match snps with corresponding genes
### For large datasets

In [10]:
# LOAD ELINK SNP_GENE FILE FROM NCBI 
import numpy as np
sSNP_Gene = np.loadtxt(fname="/Elink/snp_genes")

In [11]:
# Take snps and genes in separate numpy arrays
sSNP_nN = sSNP_Gene[:,0]
sGene_nN = sSNP_Gene[:,1]

sSNP_nN = sSNP_nN.astype('int64')
sGene_nN = sGene_nN.astype('int64')

# Sort snp names to make search faster
perm = np.argsort(sSNP_nN)
sSNP_nN = sSNP_nN[perm]
sGene_nN = sGene_nN[perm]

# SNP IDs given in the vBim_pNC['snp'] column
nNn = np.array( [ ( int(vBim_pNC['snp'][i].replace('rs', ''))) 
                 for i in range(len(vBim_pNC))]) 
nNs = np.array( [ vBim_pNC['snp'][i] 
                 for i in range(len(vBim_pNC))])

# FIND IN WHICH GENES GIVEN SNPS ARE
nGeneID_0_nG = sGene_nN[np.searchsorted(sSNP_nN, nNn)]
sGeneID_0_nG = nGeneID_0_nG.astype('str')

assert(sGeneID_0_nG.shape[0] == vBim_pNC.shape[0])

## Import RNA expression data
## Find if genes are TFs or targets

In [18]:
# Import rna expression data
import pandas as pd
import numpy as np
sRna_pRC = pd.read_csv("HBTRC data.tsv/Gene expression/AMP-AD_HBTRC_MSSM_Agilent44Karray_PFC_AgeCorrected_all.tsv", 
                       sep='\t', 
                       dtype = np.str,
                       na_values = str,
                       keep_default_na = False)
print("Gene expressions loaded.")
sRna_pRC.shape

Gene expressions loaded.


(39579, 473)

In [16]:
sGeneSymbols_nR = sRna_pRC['gene_symbol'].to_numpy()
sRna_pRC[40:50]

Unnamed: 0,reporterid,sequence,gene_symbol,clusterName,LocusLinkID,Chromosome,StartCoordinate,EndCoordinate,expr_Mean,expr_STD,...,16274,16273,16272,16271,16270,16269,16268,16267,16266,16265
40,10020272110,RSE_00000246951,RSE_00000246951,HSG00287442,,14,19767400,19767646,0.0009096187595852,0.028563033946755,...,0.019483,0.000138,0.011418,0.014406,-0.010631,-0.053129,-0.075973,-0.039212,-0.043582,0.009588
41,10023804644,Contig48774_RC,Contig48774_RC,HSG00405848,,19,47284507,47284972,-0.0008113851874822,0.0753989871525773,...,-0.052873,-0.020108,0.027628,0.022246,-0.030626,-0.057649,0.097587,-0.005713,0.025973,0.035557
42,10023804647,NM_002821,PTK7,HSG00243177,5754.0,6,43152006,43237435,-0.001299907949393,0.0901277811699355,...,0.105776,-0.062848,0.057068,-0.058405,0.041235,-0.000107,0.121285,0.003693,-0.011672,0.008062
43,10023804649,NM_005052,RAC3,HSG00302751,5881.0,17,77582820,77585368,0.0052491233503869,0.0768838865501222,...,-0.015046,-0.050878,-0.093557,-0.041936,0.073034,0.019985,0.006396,-0.003026,0.00948,0.120577
44,10023804650,NM_013302,EEF2K,HSG00294779,29904.0,16,22125256,22205452,-0.0008402626246038,0.071208396136064,...,0.054709,-0.016822,0.033285,0.002015,-0.045867,-0.080366,0.060596,0.041399,-0.018969,0.023791
45,10023804651,NM_001958,EEF1A2,HSG00312077,1917.0,20,61589809,61600949,-0.0051833846542027,0.0818523427224567,...,-0.007156,0.010638,0.043828,0.062611,0.086822,-0.004916,0.003247,-0.004339,0.031675,-0.09015
46,10023804653,NM_005309,GPT,HSG00258201,2875.0,8,145700232,145703357,0.0009668769047257,0.0397133413562591,...,-0.009082,-0.003513,0.032041,-0.001855,0.001469,-0.005628,-0.022021,-0.037238,0.034346,0.005751
47,10023804655,NM_019848,SLC10A3,HSG00254520,8273.0,X,153279351,153282699,0.0011106429006194,0.0387127625853338,...,-0.020499,-0.002079,0.021393,-0.026294,0.006283,-0.02988,0.018761,-0.013479,0.010437,0.018053
48,10023804656,NM_001513,GSTZ1,HSG00288811,2954.0,14,76857458,76867692,-5.82494503961694e-05,0.101248961269102,...,0.058883,0.123517,-0.055339,-0.077121,0.105678,0.093959,0.052151,0.050042,-0.110663,-0.057596
49,10023804657,AL110240,RP11-529I10.4,HSG00269139,25911.0,10,103344414,103359398,-0.0006099050340211,0.0559833434153445,...,-0.003699,0.093865,0.040565,0.069427,0.016685,0.030223,-0.034967,0.002586,0.043264,0.024565


### Load transcription factors and target genes

In [13]:
import os

# Set up environment variables for rpy2 to work
os.environ['R_HOME'] = '/usr/local/Cellar/r/3.6.0_2/lib/R'
os.environ['R_USER'] = '/usr/local/lib/python3.7/site-packages/rpy2'

In [14]:
# Load Rdata with rpy2
import rpy2
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()
# Clear the workspace
r( 'rm( list = ls( ) )' )
# 16GB memory limit
r( 'memory.limit( 16000 )' )
r['load']('gwasPreprocessing_v2_tfs.Rdata')
print( 'Data loaded from Rdata file.' )

# Hale the user
print( 'Variables loaded:' )
[ print( a ) for a in r['ls']() ];

Data loaded from Rdata file.
Variables loaded:
targets_iTiC
tfs_iFiC


In [17]:
##tfs_iFiC
sTFs_nPC = np.array( rpy2.robjects.r( 't( tfs_iFiC )' ),
                    dtype = np.str )
sTFs_nPC.shape = r( 'dim( tfs_iFiC )' )

## targets_iTiC
sTargets_nPC = np.array( rpy2.robjects.r( 't( targets_iTiC )' ),
                    dtype = np.str )
sTargets_nPC.shape = r( 'dim( targets_iTiC )' )

## targets_iTiC[, "tf_symbol"]
sTFs_nP = np.array( r( ' targets_iTiC[, "tf_symbol"] ' ) )

## targets_iTiC[, "target_symbol"]
sTargets_nP = np.array( r( ' targets_iTiC[, "target_symbol"] ' ) )

## Find TFs and targets given in sRna_pRS['gene_symbol']
nTFs_nR = np.array([sGeneSymbols_nR[i] in sTFs_nP for i in range(sGeneSymbols_nR.shape[0])])
sTFs_nT = sGeneSymbols_nR[np.where(nTFs_nR == True)[0]]

nTargets_nR = np.array([sGeneSymbols_nR[i] in sTargets_nP for i in range(sGeneSymbols_nR.shape[0])])
sTargets_nT = sGeneSymbols_nR[np.where(nTargets_nR == True)[0]]

In [21]:
%run loadData_utils.py
## Create pairs of transcription factors and targets
nPairs_lP = []
for tf in sTFs_nT:
    nTFtargetList = tf_Pairs_Finding(tf, 
                                     sTFs_nP, 
                                     sTargets_nP, 
                                     sGeneSymbols_nR)
    nPairs_lP.extend(nTFtargetList)

nPairs_nP = np.array(nPairs_lP)
nPairsTT_nP = sTargets_nPC[nPairs_nP]
nPairs_lPlC = [nPairsTT_nP[:,i] for i in range(5) ]

## Find unique genes that are TFs or targets in order to filter RNA expression data
sGenes_lG = []
sTFs_lT = np.unique(nPairs_lPlC[1]).tolist()
sTargets_lT = np.unique(nPairs_lPlC[3]).tolist()
sGenes_lG.extend(sTFs_lT)
sGenes_lG.extend(sTargets_lT)
uGenes_lG = np.unique(sGenes_lG).tolist()
print(len(uGenes_lG))

611


In [22]:
# Check which subjects overlap in rna expression data and snps data
cols_excluded = ["reporterid", "sequence", "gene_symbol", "clusterName", "LocusLinkID", "Chromosome", "StartCoordinate", "EndCoordinate", "expr_Mean", "expr_STD"]
sRna_0_pRS = sRna_pRC.drop(columns = cols_excluded, axis = 1)
sSubjectsRna_nR = sRna_0_pRS.columns.to_numpy()

nSubjectRnaIndexinSnps_nR = np.array( [np.where(sSubjectsRna_nR[i] == sSubjectsSnps_nN )[0][0]
                                       if len(np.where(sSubjectsRna_nR[i] == sSubjectsSnps_nN )[0]) != 0 else -1
                                       for i in range(sSubjectsRna_nR.shape[0]) ] ) 

sSubjectsRna_nR = np.delete(sSubjectsRna_nR, np.where(nSubjectRnaIndexinSnps_nR == -1)[0])

nSubjectRnaIndexinSnps_lR = np.delete(nSubjectRnaIndexinSnps_nR, np.where(nSubjectRnaIndexinSnps_nR == -1)[0]).tolist()

print("Overlapped number of subjects are: ",  sSubjectsRna_nR.shape[0])


# Select only overlapped subjects for vBed_nSN (genotypes), vFam_pSC (samples) and sRna_pRS (rna expression)
rSnp_t_nSN = vBed_nSN[nSubjectRnaIndexinSnps_lR]
rSnp_nSN = rSnp_t_nSN
rSnp_nSN[np.isnan(rSnp_t_nSN)] = 3
rSnp_nSN[rSnp_t_nSN == 2] = 3
rSnp_nSN[rSnp_t_nSN == 1] = 2
rSnp_nSN[rSnp_t_nSN == 0] = 1
genotype_categories = np.unique(rSnp_nSN)
assert(len(genotype_categories) == 3)

vFam_pSC = vFam_pSC.loc[sSubjectsRna_nR.tolist()]
sRna_1_pRS = sRna_0_pRS[sSubjectsRna_nR.tolist()]
nSs = sSubjectsRna_nR # subject names

assert( vFam_pSC.shape[0] == rSnp_nSN.shape[0] )
assert( vFam_pSC.shape[0] == sRna_1_pRS.shape[1] )
assert( rSnp_nSN.shape[0] == sRna_1_pRS.shape[1] )

sRna_2_pRS = sRna_1_pRS.set_index(sRna_pRC['LocusLinkID'])
sRna_3_pRS = sRna_2_pRS.loc[uGenes_lG]
print("Process done.")

Overlapped number of subjects are:  434
Process done.


Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike


### Create rRna_nSR, nPairs_lPlC

In [23]:
# Create transcripts
nRs = np.array( uGenes_lG )

# Find duplicate RNA expressions of each gene
sRna_lRlC = sRna_3_pRS.index.tolist()
duplicateGenes = 1

nIndex_lR = [duplicateGenes]

for iG in range(len(sRna_lRlC)-1):
    if sRna_lRlC[iG + 1 ] != sRna_lRlC[iG]:
        duplicateGenes += 1
    nIndex_lR.append(duplicateGenes)
    
nIndex_nR = np.array(nIndex_lR)

# Create a new pandas dataframe for average of RNA expression if there are multiple values of each gene
rRna_pRS = pd.DataFrame(columns=sSubjectsRna_nR)

for iR in range(len(nRs)):
    index = np.where(nIndex_nR == (iR + 1))[0]
    gene = nRs[iR]    
    sRna_4_pRS = sRna_3_pRS.iloc[index]
    sRna_4_pRS = sRna_4_pRS.replace("NA", np.nan)
    
    sRna_4_pRS = sRna_4_pRS.astype('float')
    rRna_pRS.loc[gene] = sRna_4_pRS.mean(axis = 0)


rRna_pRS = rRna_pRS.dropna(how='all')
rRna_nSR = rRna_pRS.transpose().to_numpy()
assert( rRna_nSR.shape[0] == rSnp_nSN.shape[0])
print("rRna_nSR is created.")

# Find and drop NA rows
nNA_nIndex = np.unique(np.where(sRna_3_pRS.isna())[0])
sNA_nGenes = sRna_3_pRS.iloc[nNA_nIndex].index.to_numpy()
nNA_lIndex = [np.where(sNA_nGenes[i] == nRs)[0][0] for i in range(len(sNA_nGenes))]
nRs = np.delete(nRs, nNA_lIndex)

# Create transcripts' corresponding SNP pairs
nNs_lRlN = [nNs[np.where(iR == sGeneID_0_nG)] for iR in nRs]
assert( len(nRs) == len(nNs_lRlN) )

print("Genes with missing values are dropped.")
assert( rRna_nSR.shape[1] == len(nRs) )

# Replace NA values with average of rows
row_means = np.nanmean(rRna_nSR, axis = 1) 
na_indices = np.where(np.isnan(rRna_nSR))
rRna_nSR[na_indices] = np.take(row_means, na_indices[0])
assert (len(np.where(np.isnan(rRna_nSR))[0]) == 0)
assert (len(np.where(np.isnan(rRna_nSR))[1]) == 0)
print("NA values are replaced with average of rows.")

# Update nPairs_lPlC with removed genes
nNA_Index_Pairs = np.array( [np.where(sNA_nGenes[i] == nPairs_lPlC[3])[0][0] for i in range(len(sNA_nGenes))] )
nPairs_lPlC = [np.delete(nPairs_lPlC[iC], nNA_Index_Pairs) for iC in range(len(nPairs_lPlC)) ] 
print("nPairs_lPlC is updated.")

rRna_nSR is created.
Genes with missing values are dropped.
NA values are replaced with average of rows.
nPairs_lPlC is updated.


In [218]:
# CREATE INPUTS FOR EQTL ANALYSIS
link_genotype = np.zeros(shape = rSnp_nSN.shape)
link_genotype[rSnp_nSN == 1] = 0
link_genotype[rSnp_nSN == 2] = 1
link_genotype[rSnp_nSN == 3] = 2

# GENOTYPE
eqtl_genotype = pd.DataFrame(data = rSnp_nSN.transpose(), columns=nSs)
eqtl_genotype.set_index(nNs, inplace=True)
eqtl_genotype.index.rename("snp", inplace=True)

# GENE EXPRESSION
eqtl_expression = pd.DataFrame(data = rRna_nSR.transpose(), columns = nSs)
eqtl_expression.set_index(nRs, inplace=True)
eqtl_expression.index.rename("id", inplace=True)

# LINKAGE DISEQUILIBRIUM GENOTYPE
linkdis_genotype = pd.DataFrame(data = link_genotype.transpose(), columns=nSs)
linkdis_genotype.set_index(nNs, inplace=True)
linkdis_genotype.index.rename("snp", inplace=True)

# PROBE_LOC: GENE LOCATIONS
eqtl_probe_lob_cols = ['LocusLinkID', 'Chromosome', 'StartCoordinate', 'EndCoordinate']
eqtl_probe_lob_0 = sRna_pRC[probe_lob_cols]
eqtl_probe_lob_1 = eqtl_probe_lob_0.set_index(probe_lob_0['LocusLinkID'])
eqtl_probe_lob_1 = eqtl_probe_lob_1.loc[nRs]
eqtl_probe_lob_1.drop_duplicates(subset='LocusLinkID', keep='first', inplace=True)
eqtl_probe_lob_1.drop(columns=['LocusLinkID'], inplace=True)
eqtl_probe_lob_1.index.rename('id', inplace=True)

# SNP_LOC: SNP POSITIONS
snp_loc_data = pd.DataFrame{'snp': vBim_pNC['snp'], 
                            'chrom_snp': vBim_pNC['chrom'],
                            'pos': vBim_pNC['pos'] }

eqtl_snp_loc = pd.DataFrame(data = snp_loc_data)
eqtl_snp_loc.set_index(nNs, inplace=True)
eqtl_snp_loc.drop(columns=['snp'], inplace=True)
eqtl_snp_loc.index.rename("snp", inplace=True)

# MAPPING from SNP to GENE
snp_gene_data = {'snp': nNs, 'gene': sGeneID_0_nG}
snp_gene = pd.DataFrame(data = snp_gene_data)

assert( eqtl_genotype.shape[1] == eqtl_expression.shape[1] )
assert( eqtl_genotype.shape[0] == eqtl_snp_loc.shape[0] )
assert( eqtl_expression.shape[0] == eqtl_probe_lob_1.shape[0] )
assert( eqtl_expression.index.name == eqtl_probe_lob_1.index.name)
assert( eqtl_genotype.index.name == eqtl_snp_loc.index.name)

print("eQTL inputs in pandas format created.")

# eQTL: STORE IN .CSV FORMAT
expression_csv = eqtl_expression.to_csv("../eQTL + LD analysis/HBTRC/eqtl_expression.csv",
                                        sep=",",
                                        header=True)

genotype_csv = eqtl_genotype.to_csv("../eQTL + LD analysis/HBTRC/eqtl_genotype.csv",
                                        sep=",",
                                        header=True)

probe_loc_csv = eqtl_probe_lob_1.to_csv("../eQTL + LD analysis/HBTRC/eqtl_probe_loc.csv",
                                        sep=",",
                                        header=True)

snp_loc_csv = eqtl_snp_loc.to_csv("../eQTL + LD analysis/HBTRC/eqtl_snp_loc.csv",
                                        sep=",",
                                        header=True)

print("eQTL inputs stored in CSV format.")

# LINKAGE DISEQUILIBRIUM: STORE IN .CSV FORMAT
linkdis_genotype_csv = linkdis_genotype.to_csv("../eQTL + LD analysis/HBTRC/linkdis_genotype.csv",
                                        sep=",",
                                        header=True)

# HELPER VARIABLES
subjects_csv = pd.DataFrame(data=nSs).to_csv("../eQTL + LD analysis/HBTRC/eqtl_subjects.csv",
                                   sep =",",
                                   header=True)

snps_csv = pd.DataFrame(data=nNs).to_csv("../eQTL + LD analysis/HBTRC/eqtl_snps.csv",
                                   sep =",",
                                   header=True)

genes_csv = pd.DataFrame(data=nRs).to_csv("../eQTL + LD analysis/HBTRC/eqtl_genes.csv",
                                   sep =",",
                                   header=True)

snp_gene_csv = pd.DataFrame(data=snp_gene).to_csv("../eQTL + LD analysis/HBTRC/eqtl_snp_gene.csv",
                                   sep =",",
                                   header=True)

print("eQTL input helpers stored in CSV format.")

eQTL inputs in pandas format created.
eQTL inputs stored in CSV format.
eQTL input helpers stored in CSV format.


In [71]:
# SAVE DATA
# Save all the transformed data into a pickle, such that it can be more easily loaded from Python

# Save data into Python file
import pickle
with open( 'loadData_HBTRC.pickle', 'wb' ) as f:
    pickle.dump( rRna_nSR, f, pickle.HIGHEST_PROTOCOL )
    pickle.dump( rSnp_nSN, f, pickle.HIGHEST_PROTOCOL )
    pickle.dump( nNs_lRlN, f, pickle.HIGHEST_PROTOCOL )
    pickle.dump( nPairs_lPlC, f, pickle.HIGHEST_PROTOCOL )
    pickle.dump( nRs, f, pickle.HIGHEST_PROTOCOL )
    pickle.dump( nNs, f, pickle.HIGHEST_PROTOCOL )
    pickle.dump( nSs, f, pickle.HIGHEST_PROTOCOL )
    print( 'Data saved into pickle.' )

Data saved into pickle.
