# FUMA tool
**Author:** Jesse Marks

The UK Biobank GWAS results are on the share drive in the folder:`//rcdcollaboration01.rti.ns/gxg/UKBiobank_GWASresults`. Include only the following meta-analysis output columns.

 
```
Chr

Position

MarkerName

P-value

Allele1

Allele2

Effect

StdErr```

File of interest: `Heavy_vs_never_smokers.csv`

`head Heavy_vs_never_smokers.csv`
```
rsid,chromosome,position,nonref_allele,ref_allele,minor_allele,info,MAF_%,MAC,beta,se,P,se_gc,P_gc,P_Firth
rs185832753,1,51954,G,C,C,0.556,0.112,109,-1.01,0.257,8.78e-05,0.269,0.000186,
rs199502715,1,53234,CAT,C,C,0.661,0.115,112,0.148,0.231,0.521,0.242,0.541,
rs140052487,1,54353,C,A,A,0.548,0.199,195,0.234,0.194,0.228,0.204,0.251,
rs190850374,1,55367,G,A,A,0.619,0.0287,28.1,-0.0244,0.467,0.958,0.490,0.960,
```

So, we want columns: 2,3,1,12,5,6,10,11

In [None]:
# print header
awk -F ',' ' NR==1{print $2,$3,$1,$12,$5,$6,$10,$11;exit}' Heavy_vs_never_smokers.csv > \
    Heavy_vs_never_smokers.ALL_CHR.FUMA

# rename ref_allele and minor_allele to A1 and A2 respectively (FUMA convention)
awk 'NR==1 { $5="A1"; $6="A2";print $0;exit}' Heavy_vs_never_smokers.ALL_CHR.FUMA >\
    tmp && mv tmp Heavy_vs_never_smokers.ALL_CHR.FUMA
awk -F ',' ' NR>=2 { {if ( $2 > 23) { exit }}  {print $2,$3,$1,$12,$5,$6,$10,$11}}' \
     Heavy_vs_never_smokers.csv >> Heavy_vs_never_smokers.ALL_CHR.FUMA &

gzip Heavy_vs_never_smokers.ALL_CHR.FUMA

In [None]:
chromosome position rsid P A1 A2 beta se


awk ' { if ( $1==23) {exit }} {print $0} ' <(zcat Heavy_vs_never_smokers.ALL_CHR.FUMA.gz) | head

In [None]:
## EC2 command line
for chr in {2..22};do
    zcat Heavy_vs_never_smokers.ALL_CHR.FUMA.gz | head -n1 > heavy.never.test/chr$chr.test.a

    awk -v chr="$chr" ' { if ( $1 == chr ) { print $0 } if ( $1 == chr+1 ) { exit } }' \
        <(zcat Heavy_vs_never_smokers.ALL_CHR.FUMA.gz) >> heavy.never.test/chr$chr.test.a &
done

gzip test.folder/*a &


# Chr15 and Chr7 throwing errors

## Chr15
### Remove duplicate SNPs
According to the [FUMA user guide forum](https://groups.google.com/forum/#!categories/fuma-gwas-users) and ERROR 6 may be because of duplicate SNPs. So I will test this by removing any duplicate SNPs, both dublicate positions and rsIDs.

I removed duplicate SNPs both by position and by rsID. Still recieved ERROR 006. Must be something else.

In [None]:
tail -n +2 chr15.test.a | awk '{ if (!seen[$2]++) {print $0}}' | wc -l
'706154'

tail -n +2 chr15.test.a | awk '{ if (!seen[$3]++) {print $0}}' | wc -l
'706152'

cat chr15.test.a | awk '{ if (!seen[$2]++) {print $0}}' > chr15.test.c02
cat chr15.test.c02 | awk '{ if (!seen[$3]++) {print $0}}' > chr15.test.c03
wc -l chr15.test.c03
'706120'
# job name: chr15.c03 removed all dups (rsID and position)
# This job failed; ERROR 006


### The following snp is throwing error 6
I found this SNP by the bisection method. Namely, submitting jobs with different combinations of SNPs until I had narrowed down this SNP as problematic.

In [None]:
more chr15.test.e463421
chromosome position rsid P A1 A2 beta se
15 78915370 rs34573245;rs202041589 1.14e-14 CT C -0.101 0.0131

SNP on 15:78719501 have rs72736802;rs200422183

1. remove the snp from the entire results
    - once I have identified another problematic SNP, I can add SNP 78915370 back in to make sure it is still problematic.

In [None]:
 grep -n 78915370 chr15.test.a
475138:15 78915370 rs34573245;rs202041589 1.14e-14 CT C -0.101 0.0131

sed '475138d' chr15.test.a > chr15.test.b01
# this test case failed with FUMA
# next subset these data to the first 500,000 lines

head -n500000 chr15.test.b01 > chr15.test.b02
# This test case failed with FUMA
# next subset these data to the first 450,000 lines

head -n450000 chr15.test.b01 > chr15.test.b03 # chr15.b03(removed 78915370 and subset to 450K)
# This test case past
# next subset these data to the first 475K lines

head -n475000 chr15.test.b01 > chr15.test.b04 # chr15.b04(removed 78915370 and subset to 475K)
# This test case failed, ERROR 6
# next test the last 25K (from 475-500K) to determine if any erroneous SNPs there.

head -n1 chr15.test.b01 > chr15.test.b05
tail -n +475001 chr15.test.b01 | head -n 25000 >> chr15.test.b05 # chr15.b05(removed 78915370 and subset to 475K-500K)#
# This test case passed without any errors
# next test the first 470K lines

head -n470000 chr15.test.b01 > chr15.test.b06 # chr15.b06(removed 78915370 and subset to 470K)
# This test case past; ERROR 5
# next test lines 470K-474K

head -n1 chr15.test.b01 > chr15.test.b07
tail -n +470000 chr15.test.b01 | head -n 4000 >> chr15.test.b07 # chr15.b07(removed 78915370 and subset to 470K-474K)
# This test case failed; ERROR 6
# next test lines 470K-472

head -n1 chr15.test.b01 > chr15.test.b08
tail -n +475001 chr15.test.b01 >> chr15.test.b08 # chr15.b08(removed 78915370 and subset to 475K-end)#
# This test case passed!
# next test lines 470K-472K

head -n1 chr15.test.b01 > chr15.test.b09
tail -n +470000 chr15.test.b01 | head -n 2000 >> chr15.test.b09  # chr15.b09(removed 78915370 and subset to 470K-472K)#
# This test case passed; ERROR 5
# next test 472K-473K

head -n1 chr15.test.b01 > chr15.test.b10
tail -n +472000 chr15.test.b01 | head -n 1000 >> chr15.test.b10  # chr15.b10(removed 78915370 and subset to 472K-473K)#
# This test case passed passed
# Next test 473K-474K

head -n1 chr15.test.b01 > chr15.test.b11
tail -n +473000 chr15.test.b01 | head -n 1000 >> chr15.test.b11  # chr15.b11(removed 78915370 and subset to 473K-474K)#
# This test case failed; ERROR 6
# Next test 473-473.5K

head -n1 chr15.test.b01 > chr15.test.b12
tail -n +473000 chr15.test.b01 | head -n 500 >> chr15.test.b12  # chr15.b12(removed 78915370 and subset to 473K-473.5K)#
# This test case failed; ERROR 6
# Next test 473.5K-474K

head -n1 chr15.test.b01 > chr15.test.b13
tail -n +473500 chr15.test.b01 | head -n 500 >> chr15.test.b13  # chr15.b13(removed 78915370 and subset to 473.5K-474K)#
# This test case failed; ERROR 6
# Next test 473-473.25K (note I need to test 473.5-473.75K && 473.75-474K)

head -n1 chr15.test.b01 > chr15.test.b14
tail -n +473000 chr15.test.b01 | head -n 250 >> chr15.test.b14  # chr15.b14(removed 78915370 and subset to 473K-473.25K)#
# This test case passed 
# Next test 473.25-473.5 (should fail!)

head -n1 chr15.test.b01 > chr15.test.b15
tail -n +473250 chr15.test.b01 | head -n 250 >> chr15.test.b15  # chr15.b15(removed 78915370 and subset to 473.25-473.5K)#
# This test case fail; ERROR 6
# Next test 473.75-474K

head -n1 chr15.test.b01 > chr15.test.b16
tail -n +473750 chr15.test.b01 | head -n 250 >> chr15.test.b16  # chr15.b16(removed 78915370 and subset to 473.75K-474K)#
# This test case failed; ERROR 6
# Next test 473.4-473.5


head -n1 chr15.test.b01 > chr15.test.b17
tail -n +473400 chr15.test.b01 | head -n 100 >> chr15.test.b17  # chr15.b17(removed 78915370 and subset to 473.75K-474K)#
# This test case Passed
# Next test 473.4-473.5

note that it could be markerIDs that have a semi-colon in them and or markernames that start with Affx-
* It turns out that it is the markerIDs which have multiple rsIDs separated by a semicolon. The solution is to remove the trailing rsIDs.

In [None]:
sed '473000,475000d' chr15.test.a > chr15.test.d01 # chr15.d01 (removed 473K-475K)
# failed
wc -l chr15.test.a
wc -l chr15.test.d01

sed '470000,476000d' chr15.test.a > chr15.test.d02 # chr15.d02 (removed 470K-476K)

## Problem and Solution

The problem was due to SNPs which have multiple rsID separated by ";".
For example, SNP on 15:78719501 have rs72736802;rs200422183 in rsID column.
If you select one or replace with unique ID like 15:78719501:A:T, you can avoid this error.

In [None]:
awk ' {gsub("", "")}'
awk '{ gsub("/", "_") ; system( "echo "  $0) }'

awk '{gsub(/;$/, "", $1); print $1}' > chr15.test.e463421 
$ awk '{gsub ( /;.+$/, "", $3); print $0}' chr15.test.a > chr15.test.z001

## Chr7

In [None]:
awk '{gsub ( /;.+$/, "", $3); print $0}'  chr7.test.a > chr7.test.z001

## Genome-wide

In [None]:
awk '{gsub ( /;.+$/, "", $3); print $0}' <(zcat Heavy_vs_never_smokers.ALL_CHR.FUMA.gz) >\
    Heavy_vs_never_smokers.ALL_CHR.cleaned_rsIDs.FUMA
    
gzip Heavy_vs_never_smokers.ALL_CHR.cleaned_rsIDs.FUMA