# Map genotype ids to phenotype ids
The individual IDs are made up of different portions (e.g. AS00-00347_8002022294_HHG10078_12_H06). To map these genotype IDs to the phenotype IDs is not straight forward. In particular, some of the IDs we might be able to map to the phenotype file by the cell_line portion (e.g. HHG10078) while other times we might have to try to map to the phenotype file using the serum ID (e.g. AS00-00347). The follow script does this searching for us. 

___

## files needed
* master phenotype file
* master genotype ID file
* ancestry specific genotype ID file

**master phenotype file** <br>
The corrected master phenotype file contains the variables of interest to used as phenotypes and covariates. This master phenotype file is located on S3 at:
`s3://rti-heroin/hiv_all_merged_with_uhs_all_phenotype_data_08282017.csv.gz`.
The updated CSV file uses quotes to demarcate column values.


**master genotype file** <br>
The file `master.genotype.ids.n3469` contains the genotype IDs for all of the UHS4 subjects. It has three columns:
1. The AS number portion of the genotype ID
2. The HHG, or cell_line, portion of the genotype ID
3. The full genotype ID

head -4 master.genotype.ids.n3469
```
AS# HHG#    Full_ID
AS00-00347      HHG10078        AS00-00347_8002022294_HHG10078_12_H06
AS00-00351      HHG6146 AS00-00351_8002220319_HHG6146_36_C02
AS00-00437      HHG6150 AS00-00437_8002220343_HHG6150_36_D02
```

**ancestry specific genotype ID file**
This file contains only the genotype subject IDs that you are interested in. You use this to filter the mapping file down to ancestry specific phenotype file.
head ha.ids.8
```
AS00-00584_8002022306_HHG10079_12_A07
AS00-00631_8002022318_HHG10080_15_D03
AS00-00672_8002022247_HHG10082_11_E10
AS00-00675_8002022259_HHG10083_37_D08
AS00-01586_8002022283_HHG10085_13_A10
AS00-01589_8002022295_HHG10086_11_H07
```

In [None]:
import os, itertools
import pandas as pd

os.chdir("C:/Users/jmarks/OneDrive - Research Triangle Institute/Projects/heroin/ngc/uhs4/phenotype")
phenotype_file = "unprocessed/hiv_all_merged_with_uhs_all_phenotype_data_08282017.csv"
genotype_file = "processing/troubleshoot/all_genotype_ids.n3469"
var_list = ["pmmn", "personid", "presid", "date_acq", "gwasdate_acq", 
            "gwaspresid", "hivstat", "hiv_status", "wave.x", "wave.y", "ind_id"]
date = "20190328"
out_file = "processing/troubleshoot/{}_genotype_to_phenotype_map_with_variables3.txt".format(date)

df = pd.read_csv(phenotype_file, dtype=str) 

In [None]:
# view head of dictionaries
def glance(d, size):
    return dict(itertools.islice(d.items(), size))

def gp_map(phenotype_file, var_list, df):
    # initialized dictionaries that capture the variables-of-interest information for all subjects in phenotype file
    cell_dic = {}
    serum_dic = {}
    gwas_dic = {}

    for row in range(len(df)):
        # initialize list for key-value in dictionary
        cell_dic[df.loc[row, "cell_line"]] = []
        serum_dic[df.loc[row, "serum"]] = []
        gwas_dic[df.loc[row, "gwasserum"]] = []

        for var in var_list:
            cell_dic[df.loc[row, "cell_line"]].append(df.loc[row, var])
            serum_dic[df.loc[row, "serum"]].append(df.loc[row, var])
            gwas_dic[df.loc[row, "gwasserum"]].append(df.loc[row, var])

    print("Finished creating the genotype to phenotype map.\n")
    return cell_dic, serum_dic, gwas_dic

In [None]:
def newline(dic, identifier, otherline, outF):
    line = "\t".join(str(x) for x in dic[identifier])
    line = "{}\t{}".format(otherline.strip(), line)
    outF.write(line + "\n")

def filter_ids(genotype_file, out_file, cell_d, serum_d, gwas_d):
    with open(genotype_file) as inF, open(out_file, "w") as outF:
        field1 = "genotype_ID"
        head = "\t".join(var_list)
        head = "{}\t{}\n".format(field1, head)
        outF.write(head)
        line = inF.readline()
        while line:
            sl = line.split("_")
            asnum = sl[0] 
            hhg = sl[2]

            if hhg in cell_d:
                newline(cell_d, hhg, line, outF)
            elif asnum in serum_d:
                newline(serum_d, asnum, line, outF)
            elif asnum in gwas_d:
                newline(gwas_d, asnum, line, outF)
            else:
                print("Didn't find the following ID:\n")
                print(line)
                print(hhg)
            line = inF.readline()
    print("All done filtering down to the genotype IDs of interest jess.")
    print("your file is saved to: %s" %  out_file)

In [7]:
import pandas as pd
import os, itertools
os.chdir("C:/Users/jmarks/OneDrive - Research Triangle Institute/Projects/heroin/ngc/uhs4/phenotype/processing/troubleshoot/")

from uhs_master_phenotype_parse import *
os.chdir("C:/Users/jmarks/OneDrive - Research Triangle Institute/Projects/heroin/ngc/uhs4/phenotype/")
#===============================================================================================================================

phenotype_file = "unprocessed/hiv_all_merged_with_uhs_all_phenotype_data_08282017.csv"
genotype_file = "processing/troubleshoot/all_genotype_ids.n3469" # geno ids to map to pheno
var_list = ["pmmn", "personid", "presid", "date_acq", "gwasdate_acq", 
            "gwaspresid", "hivstat", "hiv_status", "wave.x", "wave.y", "ind_id"]
date = "20190328"  # today's date
out_suffix = "_genotype_to_phenotype_map_with_variables.txt" # suffix of outfile
out_file = "processing/troubleshoot/{}{}".format(date, out_suffix)
#===============================================================================================================================

df = pd.read_csv(phenotype_file, dtype=str) 
cell_d, serum_d, gwas_d = gp_map(phenotype_file, var_list, df)
filter_ids(genotype_file, var_list, out_file, cell_d, serum_d, gwas_d)

Finished creating the genotype to phenotype map.

All done filtering down to the genotype IDs of interest jess.
your file is saved to: processing/troubleshoot/20190328_genotype_to_phenotype_map_with_variables.txt
