# UKBioBank chrX SNP lookup
**Author**: Jesse Marks

The FUMA file of UKBioBank chrX SNPs I created for Dana Hancock on December 6, 2018 was lacking rsID information. This notebook will document the efforts to find the rsIDs for those SNPs based off of their `chr:position:A1:A2` information.

The strategy we will use is to create a python dictionary from the 1000G reference panel chrX data that we have on the shared drive and then parse that dictionary to find the rsID. Ex:

{"2699555:C:A" : rs311165}

For each ID in my FUMA file, I will create an ID for it based on "position:A1:A2" then I will enter that key into the dictionary I created to see if there is a value for it. If there is a value, then I will print that rsID along with the other columns in my FUMA file. 

The following file maps position to rsID:
`/shared/common/build_conversion/b37/dbsnp_b138/uniquely_mapped_snps.positions`

In [None]:
# copy chrX FUMA file to EC2
head 20181206_20002_1439.gwas.imputed_v3.both_sexes_CHRX_maf_low-conf_filter.FUMA
scp -i ~/.ssh/gwas_rsa 20181206_20002_1439.gwas.imputed_v3.both_sexes_CHRX_maf_low-conf_filter.FUMA ec2-user@54.90.227.178:/shared/jmarks/nicotine/ukbiobank/

```
head 20181206_20002_1439.gwas.imputed_v3.both_sexes_CHRX_maf_low-conf_filter.FUMA
Chr     Position        P-value Allele1 Allele2 Effect  StdErr
23      2699555 5.76807e-02     C       A       -1.15436e-04    6.08161e-05
23      2699645 1.43253e-01     G       T       -1.15245e-04    7.87306e-05
23      2699676 5.70700e-01     G       A       -4.25255e-05    7.49979e-05
23      2699898 6.36532e-01     C       CT      -3.62646e-05    7.67420e-05
23      2699968 9.87747e-02     A       G       9.10008e-05     5.51244e-05
```

In [None]:
### Python3 ###
import gzip

data_dir = "/shared/jmarks/nicotine/ukbiobank/"
in_file = "{}/20181206_20002_1439.gwas.imputed_v3.both_sexes_CHRX_maf_low-conf_filter.FUMA".format(data_dir)
out_file = "{}/20181210_20002_1439.gwas.imputed_v3.both_sexes_CHRX_maf_low-conf_filter_rsID_added.FUMA".format(data_dir)
map_file = "/shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.legend.gz"

with open(in_file) as inF, open(out_file, "w") as outF, gzip.open(map_file) as mapF:
    map_head = mapF.readline()
    map_line = mapF.readline()
    print(map_line)
    map_dic = {}
    mapid = map_line.split()[0]
    mapid = mapid.split(":")
    print(mapid[0][0:2])

    while map_line:
        mapid = map_line.split()[0]
        mapid = mapid.split(":")[0]
        map_pos = map_line.split()[1]
        map_a1 = map_line.split()[2]
        map_a2 = map_line.split()[3]
        if mapid[0:2] == "rs":
            key = "{}:{}:{}".format(map_pos, map_a1, map_a2)
            map_dic[key] = mapid

        map_line = mapF.readline()


    head = inF.readline()
    head = "{}\t{}".format("rsID", head)
    outF.write(head)
    line = inF.readline()
    print(head)

    while line:
        pos = line.split()[1]
        a1 = line.split()[3]
        a2 = line.split()[4]

        key = "{}:{}:{}".format(pos, a1, a2)
        if key in map_dic:
            outl = "{}\t{}".format(map_dic[key], line)
            outF.write(outl)
        line = inF.readline()

In [None]:
/shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name python_rsID_mapping \
    --script_prefix rsID_mapping \
    --mem 3.8 \
    --priority 0 \
    --program python nicotine_ukbiobank_chrx_rsid_lookup.py 

In [None]:
head 20181210_20002_1439.gwas.imputed_v3.both_sexes_CHRX_maf_low-conf_filter_rsID_added.FUMA

```
rsID    Chr     Position        P-value Allele1 Allele2 Effect  StdErr
rs311165        23      2699555 5.76807e-02     C       A       -1.15436e-04    6.08161e-05
rs28579419      23      2699645 1.43253e-01     G       T       -1.15245e-04    7.87306e-05
rs60075487      23      2699676 5.70700e-01     G       A       -4.25255e-05    7.49979e-05
rs60233760      23      2699898 6.36532e-01     C       CT      -3.62646e-05    7.67420e-05
rs2306737       23      2699968 9.87747e-02     A       G       9.10008e-05     5.51244e-05
rs2306736       23      2700027 8.01447e-02     T       C       9.58006e-05     5.47478e-05
rs5939319       23      2700157 1.21295e-01     G       A       -1.20788e-04    7.79599e-05
rs5939320       23      2700202 2.90892e-01     A       G       -6.19511e-05    5.86564e-05
rs72619369      23      2700302 8.68957e-01     T       A       -1.24396e-05    7.53988e-05
```