# Map genotype ids to phenotype ids
The individual IDs are made up of different portions (e.g. AS00-00347_8002022294_HHG10078_12_H06). To map these genotype IDs to the phenotype IDs is not straight forward. In particular, some of the IDs we might be able to map to the phenotype file by the cell_line portion (e.g. HHG10078) while other times we might have to try to map to the phenotype file using the serum ID (e.g. AS00-00347). The follow script does this searching for us. 

___

## files needed
* master phenotype file
* master genotype ID file
* ancestry specific genotype ID file

**master phenotype file** <br>
The corrected master phenotype file contains the variables of interest to used as phenotypes and covariates. This master phenotype file is located on S3 at:
`s3://rti-heroin/hiv_all_merged_with_uhs_all_phenotype_data_08282017.csv.gz`.
The updated CSV file uses quotes to demarcate column values.


**master genotype file** <br>
The file `master.genotype.ids.n3469` contains the genotype IDs for all of the UHS4 subjects. It has three columns:
1. The AS number portion of the genotype ID
2. The HHG, or cell_line, portion of the genotype ID
3. The full genotype ID

head -4 master.genotype.ids.n3469
```
AS# HHG#    Full_ID
AS00-00347      HHG10078        AS00-00347_8002022294_HHG10078_12_H06
AS00-00351      HHG6146 AS00-00351_8002220319_HHG6146_36_C02
AS00-00437      HHG6150 AS00-00437_8002220343_HHG6150_36_D02
```

**ancestry specific genotype ID file**
This file contains only the genotype subject IDs that you are interested in. You use this to filter the mapping file down to ancestry specific phenotype file.
head ha.ids.8
```
AS00-00584_8002022306_HHG10079_12_A07
AS00-00631_8002022318_HHG10080_15_D03
AS00-00672_8002022247_HHG10082_11_E10
AS00-00675_8002022259_HHG10083_37_D08
AS00-01586_8002022283_HHG10085_13_A10
AS00-01589_8002022295_HHG10086_11_H07
```

## Proportion of HA subjects with HIV
We need to know what the proportion of HA subjects have HIV when we consider the subjects classified as HA during the STRUCTURE analysis when the standad 25% threshold was used and also when the 8% threshold was used.

We will need a list of those subjects. We created these two lists and copied them to our local machine.

In [None]:
cut -f2 ha_filtered.0.25 | xargs -I{} grep {} genotype.to.phenotype.map2 > genotype.to.phenotype.ha.0.25 &
cut -f2 ha_filtered.0.08 | xargs -I{} grep {} genotype.to.phenotype.map2 > genotype.to.phenotype.ha.0.08 &

In [155]:
### python ###
import itertools, os
"""
This function will parse the master phenotype file and map the genotype IDs to the
corresponding phenotype IDs. This was developed because there is no straight forward 
way to map the gentype IDs to the phenotype file. 
"""

base_dir = "/Users/jmarks/OneDrive - Research Triangle Institute/Projects/heroin/ngc/uhs4/phenotype"
date = "20190321" # current data; this will be prepended to the output-file-name
ancestry_file = "ha.ids.8"
match_var = '"viralload_log10.y"' # phenotype variable of interest in the master phenotype file

# file which contains only the subject genotype IDs of the subjects that were classified
# as HA after the STRUCTURE analysis
filter_file = "{}/processing/{}".format(base_dir, ancestry_file) 

# create the header for the output file
out_head = "{}\t{}\t{}\t{}\t{}\t{}".format("genotype_id", "phenotype_column", "ancestry_selfreport", match_var.strip('\"'), "age", "sex_selfreport")
gen = "{}/processing/master.genotype.ids.n3469".format(base_dir)
phen = "{}/unprocessed/hiv_all_merged_with_uhs_all_phenotype_data_08282017.csv".format(base_dir)
out_file = "{}/processing/{}.genotype.to.phenotype.ancestry.{}.ha_8percent.map".format(base_dir, date, match_var.strip('\"'))

#================================================================================

def main():
    print("Hello, Jesse!\n")
    gen_to_phen(match_var)
    ancestry_filter(filter_file, out_file)
    print("\nAll done, Jess!")

# view head of dictionaries
def glance(d, size):
    return dict(itertools.islice(d.items(), size))

# function that maps the genotype IDs to the phenotype variables
def gen_to_phen(match_var):
    with open(gen) as asF, open(phen) as pF:
        phead = pF.readline()
        phead = phead.split(",")
        
        # note that all the cells in the phenotype file have quotes around the entries
        # which is why we have to use double quotes for the following vars of interest
        serum_index = phead.index('"serum"')
        cell_line_index = phead.index('"cell_line"')
        gwas_index = phead.index('"gwasserum"')
        ancestry_index = phead.index('"ancestry_selfreport"')
        age = phead.index('"age"')
        sex = phead.index('"sex_selfreport"')
        hiv_index = phead.index(match_var) 

        # initialized dictionaries that capture the variables-of-interest information 
        # for all subjects in phenotype file
        cell_dic = {}
        serum_dic = {}
        gwas_dic = {}
        
        line = pF.readline() 
        while line: # parse each line of the master phenotype file
            sl = line.split(",")
            
            # creating three mapping dictionaries because we are ultimately not sure
            # which ID variable we will have to use to map the genotype ID to the corresponding
            # phenotype information. It is actually going to take a combination of all three.
            cell_dic[sl[cell_line_index]] = (phead[cell_line_index], sl[ancestry_index],
                                             sl[hiv_index], sl[age], sl[sex])
            serum_dic[sl[serum_index]] = (phead[serum_index], sl[ancestry_index], 
                                          sl[hiv_index], sl[age], sl[sex])
            gwas_dic[sl[gwas_index]] = (phead[gwas_index], sl[ancestry_index],
                                        sl[hiv_index], sl[age], sl[sex])
            line = pF.readline()
        #print(glance(cell_dic, 3))

        keep_list = []
        next(asF) # skip header line
        sline = asF.readline()
        while sline:
            spl = sline.split()
            spl = [f'"{word}"' for word in spl]
            if spl[1] in cell_dic:
                tmptup = (spl[2], cell_dic[spl[1]])
                keep_list.append(tmptup)
            elif spl[0] in serum_dic:
                tmptup = (spl[2], serum_dic[spl[0]])
                keep_list.append(tmptup)
            elif spl[0] in gwas_dic:
                tmptup = (spl[2], gwas_dic[spl[0]])
                keep_list.append(tmptup)
            else:
                print("Didn't find the following ID:\n")
                print(spl[2])
            sline = asF.readline()

        with open(out_file, 'w') as outF:
            outF.write(out_head + "\n")
            for x in mapped_ids:
                line = "\t".join(str(i).strip('\"') for i in x)
                outF.write(line + "\n")

# filter the map file that was created above to only the subjects classified
# as HA after the STRUCTURE analysis, as well as .
def ancestry_filter(filter_file, map_file):
    out_file2 = "{}.ha_only".format(map_file)
    with open(filter_file) as inF, open(map_file) as mF, open(out_file2, "w") as outF:
        head = mF.readline()
        outF.write(head)
        data_dic = {}
        line = mF.readline()
        while line:
            sl = line.split()
            data_dic[sl[0]] = line
            line = mF.readline()

        line = inF.readline()
        while line:
            sl = line.strip()
            outF.write(data_dic[sl])
            line = inF.readline()

#================================================================================

if __name__ == "__main__":
    main()

Hello, Jesse!


All done, Jess!


# Sandbox

## Variable summary
### 25%

In [None]:
# number of HA classified subjects being HIV cases using 25% threshold for the ancestry cutoff
awk '$4==1' 20190313.genotype.to.phenotype.ancestry.hiv.ha_25percent.map.ha_only| ww
"""23"""

# number of HA classified subjects being HIV cases using 25% threshold for the ancestry cutoff
awk '$4==1' 20190313.genotype.to.phenotype.ancestry.hivstat.ha_25percent.map.ha_only| ww
"""23"""

# number of HA classified subjects being HIV cases using 25% threshold for the ancestry cutoff
awk '$4==1' 20190313.genotype.to.phenotype.ancestry.hiv_status.ha_25percent.map.ha_only| ww
"""23"""

# number of HA classified subjects being HIV cases using 25% threshold for the ancestry cutoff
awk '$4==1' 20190313.genotype.to.phenotype.ancestry.gwashiv.ha_25percent.map.ha_only| ww
"""23"""

# number of HA classified subjects with viral load using 25% threshold for the ancestry cutoff
awk '$4!=-9' 20190313.genotype.to.phenotype.ancestry.viralload_cperml.y.ha_25percent.map.ha_only| ww
"""23"""

# number of HA classified subjects with viral load using 25% threshold for the ancestry cutoff
awk '$4!~"NA"' 20190313.genotype.to.phenotype.ancestry.viralload_cperml.x.ha_25percent.map.ha_only | ww
"""23"""

### 8%

In [None]:
# number of HA classified subjects being HIV cases using 8% threshold for the ancestry cutoff
awk '$4==1' 20190313.genotype.to.phenotype.ancestry.hiv.ha_8percent.map.ha_only| ww
#"""71"""

# number of HA classified subjects being HIV cases using 8% threshold for the ancestry cutoff
awk '$4==1' 20190313.genotype.to.phenotype.ancestry.hivstat.ha_8percent.map.ha_only| ww
#"""71"""

# number of HA classified subjects being HIV cases using 8% threshold for the ancestry cutoff
awk '$4==1' 20190313.genotype.to.phenotype.ancestry.hiv_status.ha_8percent.map.ha_only| ww
#"""71"""

# number of HA classified subjects being HIV cases using 8% threshold for the ancestry cutoff
awk '$4~1' 20190313.genotype.to.phenotype.ancestry.gwashiv.ha_8percent.map.ha_only| ww
#"""0"""

# number of HA classified subjects with viral load using 8% threshold for the ancestry cutoff
awk '$4!=-9' 20190313.genotype.to.phenotype.ancestry.viralload_cperml.y.ha_8percent.map.ha_only| ww
#"""66"""

# number of HA classified subjects with viral load using 8% threshold for the ancestry cutoff
awk '$4!~"NA"' 20190313.genotype.to.phenotype.ancestry.viralload_cperml.x.ha_8percent.map.ha_only | ww
#"""70"""

In [73]:
## BASH ##
awk '$4!="-9"' 20190320.genotype.to.phenotype.ancestry.viralload_log10.y.ha_8percent.map.ha_only >\
    20190320.genotype.to.phenotype.ancestry.viralload_log10.y.ha_8percent.map.ha_only.complete

'viralload_cperml.y'

## Proportion of HA subjects with HIV
We need to know what the proportion of HA subjects have HIV when we consider the subjects classified as HA during the STRUCTURE analysis when the standad 25% threshold was used and also when the 8% threshold was used.

We will need a list of those subjects. We created these two lists and copied them to our local machine.

In [None]:
cut -f2 ha_filtered.0.25 | xargs -I{} grep {} genotype.to.phenotype.map2 > genotype.to.phenotype.ha.0.25 &
cut -f2 ha_filtered.0.08 | xargs -I{} grep {} genotype.to.phenotype.map2 > genotype.to.phenotype.ha.0.08 &