# GWAS Locus Browser Locus Zoom Scripts
- **Author** - Frank Grenn
- **Date Started** - June 2019
- **Quick Description:** code to generate json files for interactive locus zoom.
- **Data:** 
input files obtained from: [META5](https://www.ncbi.nlm.nih.gov/pubmed/31701892) and [PD Progression](https://movementdisorders.onlinelibrary.wiley.com/doi/full/10.1002/mds.27845)
[Static Locus Zoom](http://locuszoom.org/)
[Interactive Locus Zoom](https://github.com/statgen/locuszoom/wiki)


In [1]:
import pandas as pd
import numpy as np

In [2]:
#get summary stats
data = pd.read_csv("resultsForSmr_filtered.tab",sep="\t")


In [3]:
print(data.shape)
print(data.head())

(7802656, 8)
           SNP A1 A2    freq       b      se       p        N
0    rs7899632  A  G  0.5665  0.0110  0.0095  0.2476  1474097
1   rs61875309  A  C  0.7953 -0.0091  0.0116  0.4295  1474097
2  rs150203744  T  C  0.0140 -0.0152  0.0649  0.8147  1351069
3  rs111551711  T  C  0.9868  0.0347  0.0742  0.6396   777210
4   rs12258651  T  G  0.8819 -0.0011  0.0149  0.9423  1474097


In [4]:
#get the snp position data to merge
pos_data = pd.read_csv("$PATH1/HRC_RS_conversion_final_with_CHR.txt", sep = "\t")


In [5]:
print(pos_data.shape)
print(pos_data.head())

(33619058, 5)
      CHR:POS      POS           ID REF ALT
0  chr1:13380  1:13380  rs571093408   C   G
1  chr1:16071  1:16071  rs541172944   G   A
2  chr1:16141  1:16141  rs529651976   C   T
3  chr1:49298  1:49298  rs200943160   T   C
4  chr1:54353  1:54353  rs140052487   C   A


In [6]:
merge = pd.merge(data,pos_data,how='left',left_on='SNP',right_on='ID')

In [8]:
print(merge.shape)
print(merge.head())

(7818616, 13)
           SNP A1 A2    freq       b      se       p        N  \
0    rs7899632  A  G  0.5665  0.0110  0.0095  0.2476  1474097   
1   rs61875309  A  C  0.7953 -0.0091  0.0116  0.4295  1474097   
2  rs150203744  T  C  0.0140 -0.0152  0.0649  0.8147  1351069   
3  rs111551711  T  C  0.9868  0.0347  0.0742  0.6396   777210   
4   rs12258651  T  G  0.8819 -0.0011  0.0149  0.9423  1474097   

           CHR:POS           POS           ID REF ALT  
0  chr10:100000625  10:100000625    rs7899632   A   G  
1  chr10:100000645  10:100000645   rs61875309   A   C  
2  chr10:100001867  10:100001867  rs150203744   C   T  
3  chr10:100002464  10:100002464  rs111551711   T   C  
4  chr10:100003242  10:100003242   rs12258651   T   G  


In [9]:
split = merge['POS'].str.split(":",n=1,expand = True)
merge['chromosome']=split[0]
merge['position']=split[1]

In [10]:
merge['log_pvalue']=-1*np.log10(merge.p)


In [11]:
def formatVariant(chrm,bp,ref,alt):
    return str(chrm)+":"+str(bp)+"_"+str(ref)+"/"+str(alt);
    

In [None]:
merge['variant']=merge.apply(lambda x: formatVariant(x.chromosome, x.position, x.REF, x.ALT), axis = 1)

In [43]:
print(merge.shape)
print(merge.head())

7818616
           SNP A1 A2    freq       b      se       p        N  \
0    rs7899632  A  G  0.5665  0.0110  0.0095  0.2476  1474097   
1   rs61875309  A  C  0.7953 -0.0091  0.0116  0.4295  1474097   
2  rs150203744  T  C  0.0140 -0.0152  0.0649  0.8147  1351069   
3  rs111551711  T  C  0.9868  0.0347  0.0742  0.6396   777210   
4   rs12258651  T  G  0.8819 -0.0011  0.0149  0.9423  1474097   

           CHR:POS           POS           ID REF ALT chromosome   position  \
0  chr10:100000625  10:100000625    rs7899632   A   G         10  100000625   
1  chr10:100000645  10:100000645   rs61875309   A   C         10  100000645   
2  chr10:100001867  10:100001867  rs150203744   C   T         10  100001867   
3  chr10:100002464  10:100002464  rs111551711   T   C         10  100002464   
4  chr10:100003242  10:100003242   rs12258651   T   G         10  100003242   

   log_pvalue  analysis           variant  
0    0.606249        45  10:100000625_A/G  
1    0.367037        45  10:100000645_

In [42]:
merge.to_csv("$PATH1/locuszoom/tempdata.csv", index=False)

In [27]:
alldata = pd.read_csv("$PATH1/locuszoom/tempdata.csv")

In [28]:
alldata = alldata.rename(columns={"b":"beta","REF":"ref_allele","freq":"ref_allele_freq"})

print(alldata.head())

           SNP A1 A2  ref_allele_freq    beta      se       p        N  \
0    rs7899632  A  G           0.5665  0.0110  0.0095  0.2476  1474097   
1   rs61875309  A  C           0.7953 -0.0091  0.0116  0.4295  1474097   
2  rs150203744  T  C           0.0140 -0.0152  0.0649  0.8147  1351069   
3  rs111551711  T  C           0.9868  0.0347  0.0742  0.6396   777210   
4   rs12258651  T  G           0.8819 -0.0011  0.0149  0.9423  1474097   

           CHR:POS           POS           ID ref_allele ALT  chromosome  \
0  chr10:100000625  10:100000625    rs7899632          A   G        10.0   
1  chr10:100000645  10:100000645   rs61875309          A   C        10.0   
2  chr10:100001867  10:100001867  rs150203744          C   T        10.0   
3  chr10:100002464  10:100002464  rs111551711          T   C        10.0   
4  chr10:100003242  10:100003242   rs12258651          T   G        10.0   

      position           variant  log_pvalue  analysis  
0  100000625.0  10:100000625_A/G    0.606

In [67]:
loci = pd.read_csv("$PATH1/GWAS_loci_overview.csv")
print(loci.head())

   Locus Number DONE?  Date when done? Volunteer1 Volunteer2          SNP  \
0             1   NaN              NaN  Corrnelis      Lynne  rs114138760   
1             1   NaN              NaN  Corrnelis      Lynne   rs35749011   
2             1   NaN              NaN  Corrnelis      Lynne   rs76763715   
3             2   NaN              NaN        NaN        NaN    rs6658353   
4             3   NaN              NaN    Jillian   Emmeline   rs11578699   

  CHR:BP (hg19)     full region (hg19)  Number of genes  CHR  ...  \
0   1:154898185  1:153898185-155898185               92    1  ...   
1   1:155135036  1:154135036-156135036               92    1  ...   
2   1:155205634  1:154205634-156205634               92    1  ...   
3   1:161469054  1:160469054-162469054               55    1  ...   
4   1:171719769  1:170719769-172719769               24    1  ...   

  Effect allele Other allele Effect allele frequency Beta, all studies  \
0             c            g                  0.

In [68]:
split = loci['CHR:BP (hg19)'].str.split(":",n=1,expand = True)
loci['position']=split[1]

In [69]:

loci['position'] = loci['position'].astype('int32')
print(loci.dtypes)
print(loci.head())

Locus Number                                int64
DONE?                                      object
Date when done?                           float64
Volunteer1                                 object
Volunteer2                                 object
SNP                                        object
CHR:BP (hg19)                              object
full region (hg19)                         object
Number of genes                             int64
CHR                                         int64
BP                                         object
Nearest Gene                               object
META5 QTL Nominated Gene (nearest QTL)     object
Effect allele                              object
Other allele                               object
Effect allele frequency                   float64
Beta, all studies                         float64
SE, all studies                           float64
P, all studies                            float64
P, COJO, all studies                      float64


In [99]:
#loop
for i in range(len(loci.index)):
    chrm = str(loci.iloc[i]['CHR']);
    pos = loci.iloc[i]['position'];
    start = pos - 1000000;
    end = pos + 1000000;
    
    
    #subset by chromosome
    chrdata = merge[(merge['chromosome'] == chrm)]

    chrdata['position'] = chrdata['position'].astype('int32')


    #and then by position
    rangeddata = chrdata[(chrdata['position'] >= start) & (chrdata['position'] <= end)]
    #add quotes around certain fields to make locus zoom happy
    ref_allele_str = '"'+rangeddata['REF']+'"'
    chr_str = '"'+rangeddata['chromosome']+'"'
    variant_str = '"'+rangeddata['variant']+'"'

    rangeddata['REF']=ref_allele_str
    rangeddata['chromosome']=chr_str
    rangeddata['variant']=variant_str


    chromosome=','.join(map(str,rangeddata['chromosome'].tolist()))
    log_pvalue=','.join(map(str,rangeddata['log_pvalue'].tolist()))
    position=','.join(map(str,rangeddata['position'].tolist()))
    ref_allele=','.join(map(str,rangeddata['REF'].tolist()))
    variant=','.join(map(str,rangeddata['variant'].tolist()))

    jsonstring = '{{\
	    "data": {{\
	        "chromosome": [\
                {}\
	        ],\
    	    "log_pvalue": [\
        	    {}\
        	],\
        	"position": [\
	            {}\
	        ],\
	        "ref_allele": [\
	            {}\
	        ],\
	        "variant": [\
	            {}\
	        ]\
	    }},\
	    "lastPage": null\
	}}'.format(chromosome,log_pvalue,position,ref_allele,variant)

    print(loci.iloc[i]['SNP'] + " " + str(loci.iloc[i]['CHR']) + ":" + str(loci.iloc[i]['position']))
    json_file = open("$PATH1/locuszoom/interactive_stats/"+loci.iloc[i]['SNP']+"_locus.json", "w")
    json_file.write(jsonstring)
    json_file.close()
	
	
	

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pan

                 SNP A1 A2    freq       b      se         p        N  \
2868767   rs60382526  T  C  0.0457 -0.0132  0.0232  0.568300  1474097   
2868770  rs141860636  A  G  0.9891 -0.0992  0.0789  0.209000  1349460   
2868771   rs77490383  T  G  0.0143  0.2230  0.0635  0.000444   907933   
2868772   rs10796968  A  C  0.6940  0.0271  0.0101  0.007473  1474097   
2868774   rs76189736  C  G  0.9543  0.0133  0.0232  0.567100  1474097   

                CHR:POS          POS           ID REF ALT chromosome  \
2868767  chr1:153898554  1:153898554   rs60382526   C   T          1   
2868770  chr1:153898697  1:153898697  rs141860636   A   G          1   
2868771  chr1:153899248  1:153899248   rs77490383   G   T          1   
2868772  chr1:153899330  1:153899330   rs10796968   A   C          1   
2868774  chr1:153900174  1:153900174   rs76189736   C   G          1   

          position  log_pvalue  analysis          variant  
2868767  153898554    0.245422        45  1:153898554_C/T  
2868770 

## Progression Stats

In [96]:
#data = pd.read_csv("$PATH1/locuszoom/base_INS.txt",sep="\t")
#locus_snp='rs61863020'

data = pd.read_csv("$PATH1/locuszoom/surv_HY3.txt",sep="\t")
locus_snp='rs382940'

In [105]:
print(len(data.index))
print(data.head())

8623151
            SNP    BETA      SE        P     N  NSTUDY   Isq
0    5:29439275 -0.0322  0.0657  0.62410  2582       9  14.3
1    5:85928892  0.2634  0.1526  0.08427  1299       5   0.0
2   2:170966953  0.4025  0.2870  0.16080  2265       8   0.0
3  10:128341232 -0.1408  0.0783  0.07199  1299       5  46.5
4    3:62707519 -0.1344  0.1723  0.43550  1299       5   0.0


In [97]:
pos_data = pd.read_csv("$PATH1/locuszoom/reference.txt")

In [98]:
print(len(pos_data.index))
print(pos_data.head())

9240625
       SNP         RSID  CHR    START REF ALT     MAF            FUNC NearGENE
0  1:14470          NaN    1  14470.0   G   A  0.0263    ncRNA_exonic   WASH7P
1  1:14671  rs201055865    1  14671.0   G   C  0.0156    ncRNA_exonic   WASH7P
2  1:14773  rs878915777    1  14773.0   C   T  0.0178    ncRNA_exonic   WASH7P
3  1:16841   rs62636368    1  16841.0   G   T  0.0725  ncRNA_intronic   WASH7P
4  1:16856    rs3891260    1  16856.0   A   G  0.0199  ncRNA_splicing   WASH7P


In [99]:
merge = pd.merge(data,pos_data,how='left',left_on='SNP',right_on='SNP')

In [100]:
merge['START'] = merge['START'].astype('int32')
merge['CHR'] = merge['CHR'].astype('str')

In [101]:
print(len(merge.index))
print(merge.head())

8623151
            SNP    BETA      SE        P     N  NSTUDY   Isq         RSID CHR  \
0    5:29439275 -0.0322  0.0657  0.62410  2582       9  14.3          NaN   5   
1    5:85928892  0.2634  0.1526  0.08427  1299       5   0.0  rs113534962   5   
2   2:170966953  0.4025  0.2870  0.16080  2265       8   0.0  rs559397866   2   
3  10:128341232 -0.1408  0.0783  0.07199  1299       5  46.5          NaN  10   
4    3:62707519 -0.1344  0.1723  0.43550  1299       5   0.0          NaN   3   

       START REF ALT     MAF        FUNC                NearGENE  
0   29439275   C   T  0.3851  intergenic  LINC02064;LOC105374704  
1   85928892   C   T  0.0591  intergenic         COX7C;LINC02059  
2  170966953   T   C  0.0165  intergenic              UBR3;MYO3B  
3  128341232   C   T  0.4450    intronic                C10orf90  
4   62707519   C   T  0.0619    intronic                   CADPS  


In [102]:
merge['log_pvalue']=-1*np.log10(merge.P)

In [103]:
merge['variant']=merge.apply(lambda x: formatVariant(x.CHR, x.START, x.REF, x.ALT), axis = 1)

In [104]:
print(len(merge.index))
print(merge.head())

8623151
            SNP    BETA      SE        P     N  NSTUDY   Isq         RSID CHR  \
0    5:29439275 -0.0322  0.0657  0.62410  2582       9  14.3          NaN   5   
1    5:85928892  0.2634  0.1526  0.08427  1299       5   0.0  rs113534962   5   
2   2:170966953  0.4025  0.2870  0.16080  2265       8   0.0  rs559397866   2   
3  10:128341232 -0.1408  0.0783  0.07199  1299       5  46.5          NaN  10   
4    3:62707519 -0.1344  0.1723  0.43550  1299       5   0.0          NaN   3   

       START REF ALT     MAF        FUNC                NearGENE  log_pvalue  \
0   29439275   C   T  0.3851  intergenic  LINC02064;LOC105374704    0.204746   
1   85928892   C   T  0.0591  intergenic         COX7C;LINC02059    1.074327   
2  170966953   T   C  0.0165  intergenic              UBR3;MYO3B    0.793714   
3  128341232   C   T  0.4450    intronic                C10orf90    1.142728   
4   62707519   C   T  0.0619    intronic                   CADPS    0.361012   

            variant  
0 

In [106]:
locus_snp_data = merge[(merge['RSID']==locus_snp)]
locus_snp_index = locus_snp_data.index.tolist()[0]
print(locus_snp_index)
print(locus_snp_data)


2133204
                 SNP   BETA      SE             P     N  NSTUDY   Isq  \
2133204  9:108058562  0.711  0.1289  3.460000e-08  1890       7  44.8   

             RSID CHR      START REF ALT     MAF      FUNC NearGENE  \
2133204  rs382940   9  108058562   T   A  0.0864  intronic  SLC44A1   

         log_pvalue          variant  
2133204    7.460924  9:108058562_T/A  


In [107]:
locus_snp_data=merge.iloc[locus_snp_index]
print(locus_snp_data)
chrm = str(locus_snp_data['CHR']);
pos = locus_snp_data['START'];
start = pos - 1000000;
end = pos + 1000000;
print(chrm);
print('\n');


#subset by chromosome
chrdata = merge[(merge['CHR']==chrm)]

print(chrdata.dtypes)
print(chrdata.head())


#chrdata['START'] = chrdata['START'].astype('int32')


#and then by position
rangeddata = chrdata[(chrdata['START'] >= start) & (chrdata['START'] <= end)]
print('rangeddata')
#add quotes around certain fields to make locus zoom happy
ref_allele_str = '"'+rangeddata['REF']+'"'
chr_str = '"'+rangeddata['CHR']+'"'
variant_str = '"'+rangeddata['variant']+'"'

rangeddata['REF']=ref_allele_str
rangeddata['CHR']=chr_str
rangeddata['variant']=variant_str


chromosome=','.join(map(str,rangeddata['CHR'].tolist()))
log_pvalue=','.join(map(str,rangeddata['log_pvalue'].tolist()))
position=','.join(map(str,rangeddata['START'].tolist()))
ref_allele=','.join(map(str,rangeddata['REF'].tolist()))
variant=','.join(map(str,rangeddata['variant'].tolist()))

jsonstring = '{{\
    "data": {{\
        "chromosome": [\
            {}\
	    ],\
    	"log_pvalue": [\
        	{}\
        ],\
        "position": [\
	        {}\
	    ],\
	    "ref_allele": [\
	        {}\
	    ],\
	    "variant": [\
	        {}\
	    ]\
	}},\
	"lastPage": null\
}}'.format(chromosome,log_pvalue,position,ref_allele,variant)

print(locus_snp_data['RSID'] + " " + str(locus_snp_data['CHR']) + ":" + str(locus_snp_data['START']))
json_file = open("$PATH1/locuszoom/interactive_stats/"+locus_snp_data['RSID']+"_locus.json", "w")
json_file.write(jsonstring)
json_file.close()
	
	
	

SNP               9:108058562
BETA                    0.711
SE                     0.1289
P                    3.46e-08
N                        1890
NSTUDY                      7
Isq                      44.8
RSID                 rs382940
CHR                         9
START               108058562
REF                         T
ALT                         A
MAF                    0.0864
FUNC                 intronic
NearGENE              SLC44A1
log_pvalue            7.46092
variant       9:108058562_T/A
Name: 2133204, dtype: object
9


SNP            object
BETA          float64
SE            float64
P             float64
N               int64
NSTUDY          int64
Isq           float64
RSID           object
CHR            object
START           int32
REF            object
ALT            object
MAF           float64
FUNC           object
NearGENE       object
log_pvalue    float64
variant        object
dtype: object
             SNP    BETA      SE        P     N  NSTUDY   Isq        

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
