# Upload data and visualize
I need to grab the beta value and the standard error (beta) for each SNP accross each cohort.

**Don't forget** I need to grab the minor allele and major allele of each SNP across each cohort. I will determine the orientation. Then, I will use a "majority rules" approach to determine if I need to flip the sign of the beta. Specifically, if the major and minor alleles are listed differently for a few cohorts, then I will need to flip the sign of the beta (i.e. change from negative to positive or vice versa.)

## Sample data
This sample data is for one variant - `rs1008078` - accross all the cohorts in meta-analysis 044. I will store the results in a dictionary. The key will be the rsID and the key will be to a pandas dataframe.

In [349]:
import os
import pandas as pd
import numpy as np
import math
from decimal import Decimal
from scipy.stats import chisqprob

# I will create a new data frame for each variant. This data frame will have the column names:
# (1) cohort, (2) Ancestry group, (3) Beta, (4) Std. Error, (5) Seweighted, and (6) Pr(>|t|) which is the p-val
# addictionally  I will add the rows for calculating the meta in the Seweighted column and below all of the cohorts
# getting the first cohort, note that this will eventually be in a loop of the cohorts

os.chdir("C:\\Users\\jmarks\\Desktop\\Projects\\Nicotine\\GSCAN_extended_results_nicotine\\results\\results_from_missing_snp_lookup\\")
mydata = pd.ExcelFile("sample_missing_SNP_results.xlsx")
mydata = mydata.parse("Sheet1")

cohorts_list44 = ["AAND_COGEND2_AA",
"COGEND_AA",
"COGEND_EA",
"COGEND2_AA",
"COGEND2_EA",
"COPDGene_AA",
"COPDGene_EA",
"deCODE_EA",
"Dental_Caries_EA",
"EAGLE_EA",
"FINN_TWIN_EA",
"GAIN_AA",
"GAIN_EA",
"JHS_AA",
"nonGAIN_EA",
"NTR_EA",
"SAGE_AA",
"SAGE_EA",
"UW_TTURC_AA",
"UW_TTURC_EA",
"YALE_PENN_AA",
"YALE_PENN_EA"]

totalRows = len(cohorts_list44)

# This dictionary will have an rsID for the key and the key will be to a dataframe
dataDict = {}

# initialize a dataframe
emptyArray = np.empty((totalRows,13,))
emptyArray[:] = np.nan
columns = ["SNP", "Cohort", "Ancestry group", "Beta", "Std. Error", "Seweighted", "Pr(>|t|)", 
           "AllMeta.SumSEweight", "AllMeta.weightedSE", "AllMeta.SEweighted_beta", 
           "AllMeta.SEweighted_Z", "AllMeta.SEweighted_Chi", "AllMeta.SEweighted_P"]
num_of_rsIDs = len(mydata)

# initialize a meta-calculation
metaSEweighted_beta = 0

# loop to fill in information for the meta-anlaysis calculation
for rsID in range(num_of_rsIDs):
    
    markerName = mydata.iloc[rsID,0]
    dataDict[markerName] = pd.DataFrame(columns=columns, data=emptyArray)
    dataDict[markerName].iloc[rsID,0] = markerName # add SNP
    
    for cohort in range(len(cohorts_list44)):
        
        #print(cohorts_list44[cohort])
        # get all of the cohort specific data
        cohortData = mydata.filter(like=cohorts_list44[cohort]).iloc[0,:]
        
        
        
        #cohortData
        
        # add cohort to dataframe
        cohortName = cohorts_list44[cohort]
        dataDict[markerName].iloc[cohort,1] = cohorts_list44[cohort]
        
        # add Ancestry group
        ancestry = cohorts_list44[cohort][-2:]
        dataDict[markerName].iloc[cohort, 2] = ancestry
        
        # add Beta
        betaVal = cohortData.filter(like=".beta")[0]
        #print(betaVal)
        dataDict[markerName].iloc[cohort, 3] = betaVal
        
        # add Std. Error
        standardErr = cohortData.filter(like="sebeta")[0]
        dataDict[markerName].iloc[cohort, 4] = standardErr
        # add Seweighted
        seWeighted = 1 / (standardErr ** 2)
        dataDict[markerName].iloc[cohort,5]  = seWeighted
        
        # add p-val
        pVal = cohortData.filter(regex=".p$")[0]
        dataDict[markerName].iloc[cohort, 6] = pVal
        
        # add to metaSEweighted_beta
        metaSEweighted_beta += betaVal*standardErr
        

# Meta calculations
SumSEweight = dataDict[markerName]['Seweighted'].sum()
dataDict[markerName].iloc[0, 7] = SumSEweight
    
metaWeightedSE = math.sqrt(1/SumSEweight)
dataDict[markerName].iloc[0, 8] = metaWeightedSE

dataDict[markerName].iloc[0, 9] = metaSEweighted_beta

metaSEweighted_Z = (metaSEweighted_beta / metaWeightedSE)
dataDict[markerName].iloc[0, 10] = metaSEweighted_Z

metaSEweighted_chi = metaSEweighted_Z ** 2
dataDict[markerName].iloc[0, 11] = metaSEweighted_chi

metaSEweighted_P = '%.2E' % Decimal(chi2.sf(metaSEweighted_chi, 1))
dataDict[markerName].iloc[0, 12] = metaSEweighted_P


dataDict[markerName]
dataDict[markerName].to_csv("C:\\Users\\jmarks\\Desktop\\out.file", sep='\t', index=False)

In [298]:
cohorts_list44[5]

'COPDGene_AA'

In [150]:
import numpy as np
a = np.empty((29,7,))
a[:] = np.nan
print(a)
tempDF = pd.DataFrame(np.nan, index=[0,1,2,3],columns=["a","b","c"])
#tempDF

[[ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan  nan  nan  nan  nan  nan]
 [ nan  nan 

In [112]:
dataDict['rs1008078']

Unnamed: 0,SNP,Cohort,Ancestry group,Beta,Std. Error,Seweighted,Pr(>|t|)


# 044 meta-analysis

In [51]:
import os
import pandas as pd



os.chdir("C:\\Users\\jmarks\\Desktop\\Projects\\Nicotine\\GSCAN_extended_results_nicotine\\results\\results_from_missing_snp_lookup\\")


#os.listdir()
#xl = pd.ExcelFile("missing_SNPs_results_prefiltered_meta_analyses_044_045_046_V02.xlsx")
xl.sheet_names
zero44 = xl.sheet_names[0]
zero44 = xl.parse(zero44)
zero45 = xl.sheet_names[1]
zero45 = xl.parse(zero45)
zero46 = xl.sheet_names[2]
zero46 = xl.parse(zero46)
zero44.iloc[0:1,:]

# beta is in each, but also sebeta is too.

IndexError: list index out of range

# 045 meta-analysis

# 046 meta-analysis