# Upload data and visualize
I need to grab the beta value and the standard error (beta) for each SNP accross each cohort.

**Don't forget** I need to grab the minor allele and major allele of each SNP across each cohort. I will determine the orientation. Then, I will use a "majority rules" approach to determine if I need to flip the sign of the beta. Specifically, if the major and minor alleles are listed differently for a few cohorts, then I will need to flip the sign of the beta (i.e. change from negative to positive or vice versa.)

**Also** run test of code against `rs16969968`

## Sample data
This sample data is for one variant - `rs1008078` - accross all the cohorts in meta-analysis 044. I will store the results in a dictionary. The key will be the rsID and the value will be a pandas dataframe.

In [130]:
import os
import pandas as pd
import numpy as np
import math
from decimal import Decimal
from scipy.stats import chi2
"""
This function takes as input (1) a list of cohorts specific to a meta-analysis,
(2) and an excel file containing the data on each variant that was the results 
of a SNP look-up. You need to specify the name of the sheet as well. This file should contain a header. For each column, the heading should be
the cohort name & ancestry followed by a period followed by the data description. Specifically,
the data that are of interest for this script are: the beta value, standard error, and the variant
specific p-value. An example of how these entries should be in the excel sheet is

Example:
AAND_COGEND2_AA.beta_SNP_add,  AAND_COGEND2_AA.sebeta_SNP_add, AAND_COGEND2_AA.p

For (1) an example of the input list is: 
AAND_COGEND2_AA, DECODE_EA, NONGAIN_EA 

The output will be an excel file with the meta-analysis calculations for each variant of interest.
"""
# I will create a new data frame for each variant. This data frame will have the column names:
# (1) cohort, (2) Ancestry group, (3) Beta, (4) Std. Error, (5) Seweighted, and (6) Pr(>|t|) which is the p-val
# addictionally  I will add the rows for calculating the meta in the Seweighted column and below all of the cohorts
# getting the first cohort, note that this will eventually be in a loop of the cohorts

#os.chdir("C:\\Users\\jmarks\\Desktop\\Projects\\Nicotine\\GSCAN_extended_results_nicotine\\results\\results_from_missing_snp_lookup\\")
os.chdir("C:\\Users\\jmarks\\Desktop\\20180201")
mydata = pd.ExcelFile("rs910_example.xlsx")
mydata = mydata.parse("Sheet1")

cohorts_list44 = ["AAND_COGEND2_AA",
"COGEND_AA",
"COGEND_EA",
"COGEND2_AA",
"COGEND2_EA",
"COPDGene_AA",
"COPDGene_EA",
"deCODE_EA",
"Dental_Caries_EA",
"EAGLE_EA",
"FINN_TWIN_EA",
"GAIN_AA",
"GAIN_EA",
"JHS_AA",
"nonGAIN_EA",
"NTR_EA",
"SAGE_AA",
"SAGE_EA",
"UW_TTURC_AA",
"UW_TTURC_EA",
"YALE_PENN_AA",
"YALE_PENN_EA"]

totalRows = len(cohorts_list44)

# This dictionary will have an rsID for the key and the key value will be a dataframe
dataDict = {}

# initialize a dataframe
emptyArray = np.empty((totalRows,13,))
emptyArray[:] = np.nan
columns = ["SNP", "Cohort", "Ancestry group", "Beta", "Std. Error", "Seweighted", "Pr(>|t|)", 
           "AllMeta.SumSEweight", "AllMeta.weightedSE", "AllMeta.SEweighted_beta", 
           "AllMeta.SEweighted_Z", "AllMeta.SEweighted_Chi", "AllMeta.SEweighted_P"]
num_of_rsIDs = len(mydata)
print(num_of_rsIDs)

# Above this write a script which removes the variants who were not present in any of the cohorts



# list of SNPs which were all NA across all cohorts
noDataSNPs = []

# loop to fill in information for the meta-anlaysis calculation
for rsID in range(num_of_rsIDs):
    
    # check SNP missing across all cohorts
    if pd.isnull(mydata.iloc[rsID,3:]).all():
        noDataSNPs.append(mydata.iloc[rsID,0])
        break
    
    markerName = mydata.iloc[rsID,0]
    dataDict[markerName] = pd.DataFrame(columns=columns, data=emptyArray)
    dataDict[markerName].iloc[0,0] = markerName # add SNP
    metaSEweighted_beta = 0
    
    
    for cohort in range(len(cohorts_list44)):
        
        # get all of the cohort specific data
        cohortData = mydata.filter(like=cohorts_list44[cohort]).iloc[rsID,:]
        
        # add cohort to dataframe
        cohortName = cohorts_list44[cohort]
        dataDict[markerName].iloc[cohort,1] = cohorts_list44[cohort][0:-3]
            
        # add Ancestry group
        ancestry = cohorts_list44[cohort][-2:]
        dataDict[markerName].iloc[cohort, 2] = ancestry
        # add Beta
        betaVal = cohortData.filter(like=".beta")[0]
        
        # flip the sign for deCODE and NTR
        if cohorts_list44[cohort] == "FINN_TWIN_EA":
            betaVal = -betaVal
        dataDict[markerName].iloc[cohort, 3] = betaVal
        
        # add Std. Error
        standardErr = cohortData.filter(like="sebeta")[0]
        dataDict[markerName].iloc[cohort, 4] = standardErr
        
        # add Seweighted
        seWeighted = 1 / (standardErr ** 2)
        dataDict[markerName].iloc[cohort,5]  = seWeighted
        
        # add p-val
        pVal = cohortData.filter(regex=".p$")[0]
        dataDict[markerName].iloc[cohort, 6] = pVal
        
        #  metaSEweighted_beta calculation 
        if not np.isnan(betaVal):
            metaSEweighted_beta += (betaVal*seWeighted)
      
    # Meta calculations
    SumSEweight = dataDict[markerName]['Seweighted'].sum()
    dataDict[markerName].iloc[0, 7] = SumSEweight

    metaWeightedSE = math.sqrt(1/SumSEweight)
    dataDict[markerName].iloc[0, 8] = metaWeightedSE
    
    metaSEweighted_beta = metaSEweighted_beta / SumSEweight 
    dataDict[markerName].iloc[0, 9] = metaSEweighted_beta

    metaSEweighted_Z = (metaSEweighted_beta / metaWeightedSE)
    dataDict[markerName].iloc[0, 10] = metaSEweighted_Z

    metaSEweighted_chi = metaSEweighted_Z ** 2
    dataDict[markerName].iloc[0, 11] = metaSEweighted_chi

    metaSEweighted_P = '%.2E' % Decimal(chi2.sf(metaSEweighted_chi, 1))
    dataDict[markerName].iloc[0, 12] = metaSEweighted_P


noDataSNPs
#dataDict[markerName]
#dataDict["rs1008078"]
#dataDict["rs1022528"]
#dataDict[markerName].to_csv("C:\\Users\\jmarks\\Desktop\\out.file", sep='\t', index=False)itei

3


['rs10698713']

In [125]:
type(mydata.iloc[2,3:])
pd.isnull(mydata.iloc[2,3:]).all()

True

In [119]:
for key in dataDict:
    print(key)
    print(dataDict[key])
    #dataDict[key]

rs910083
         SNP         Cohort Ancestry group      Beta  Std. Error   Seweighted  \
0   rs910083   AAND_COGEND2             AA -0.053740    0.029117  1179.515803   
1        NaN         COGEND             AA  0.010431    0.048347   427.812778   
2        NaN         COGEND             EA -0.027238    0.025962  1483.600659   
3        NaN        COGEND2             AA -0.053740    0.029117  1179.515803   
4        NaN        COGEND2             EA -0.063911    0.063533   247.746052   
5        NaN       COPDGene             AA -0.052426    0.026454  1428.918475   
6        NaN       COPDGene             EA -0.022262    0.022076  2051.858585   
7        NaN         deCODE             EA  0.020300    0.010500  9070.294785   
8        NaN  Dental_Caries             EA -0.087564    0.058976   287.503729   
9        NaN          EAGLE             EA -0.030085    0.018898  2800.007008   
10       NaN      FINN_TWIN             EA -0.012611    0.021360  2191.782743   
11       NaN       

## Need to determine what to do with NaN values in my data.
### Also, if data is NaN for all cohorts, need to remove them from my data (compile a list of these SNPs)
ask Dana how these go into calculation

# 044 meta-analysis

In [51]:
import os
import pandas as pd



os.chdir("C:\\Users\\jmarks\\Desktop\\Projects\\Nicotine\\GSCAN_extended_results_nicotine\\results\\results_from_missing_snp_lookup\\")


#os.listdir()
#xl = pd.ExcelFile("missing_SNPs_results_prefiltered_meta_analyses_044_045_046_V02.xlsx")
xl.sheet_names
zero44 = xl.sheet_names[0]
zero44 = xl.parse(zero44)
zero45 = xl.sheet_names[1]
zero45 = xl.parse(zero45)
zero46 = xl.sheet_names[2]
zero46 = xl.parse(zero46)
zero44.iloc[0:1,:]

# beta is in each, but also sebeta is too.

IndexError: list index out of range

# 045 meta-analysis

# 046 meta-analysis