# UHS4 Phenotype File Construction: Map genotype ids to phenotype ids
**Author**: Jesse Marks

This notebook documents the steps taken to link the UHS4 genotype data to the phenotype data.  The phenotype data have been compiled into a master file constructed by Bryan Quach. The linkage from genotype IDs to the phenotype file is not straightforward. For this reason, a custom Python module was developed to link the genotype IDs to there respective phenotypes. The individual IDs are made up of different portions (e.g. AS00-00347_8002022294_HHG10078_12_H06). Some of the IDs we might be able to map to the phenotype file by the cell_line portion (e.g. HHG10078), then other times we might have to try to map to the phenotype file using the serum ID portion of the genotype ID (e.g. AS00-00347). The follow script does this searching for us. 

___


## Files Needed 
* master phenotype file
* genotype ID file

**master phenotype file** <br>
The master phenotype file contains the variables of interest to used as phenotypes and covariates. This master phenotype file is located on S3 at:
`s3://rti-heroin/hiv_all_merged_with_uhs_all_phenotype_data_08282017.csv.gz`.
The updated CSV file uses quotes to demarcate column values.

**genotype ID file** <br>
This file contains only the genotype subject IDs that you are interested in. It is used to indicate which specific subjects need to be extracted from the phenotype file.
head ha.ids.8
```
AS00-00584_8002022306_HHG10079_12_A07
AS00-00631_8002022318_HHG10080_15_D03
AS00-00672_8002022247_HHG10082_11_E10
AS00-00675_8002022259_HHG10083_37_D08
AS00-01586_8002022283_HHG10085_13_A10
AS00-01589_8002022295_HHG10086_11_H07
```

## Variables to Specify 
These are the variables to specify in the `main` function.

* phenotype_file: `The path to the master phenotype file`
* genotype_file: `The path to the genotype IDs file`
* var_list: `A list of variables to pull from the phenotype file`
* date: `Today's date (This will be prepended to the output file name)`
* out_suffix: `The name of the output file (appended to today's date`
* out_file: `combines date & out_suffix`

In [6]:
import pandas as pd
import os, itertools
os.chdir("C:/Users/jmarks/OneDrive - Research Triangle Institute/Projects/heroin/ngc/uhs4/merged2")

def main():

    phenotype_file = "../phenotype/unprocessed/hiv_all_merged_with_uhs_all_phenotype_data_08282017.csv"
    genotype_file = "uhs4.merged2.ea.genotype.ids" # geno ids to map to pheno
    var_list = ["age", "sex_selfreport", "totopioid_tot_30d"]
    date = "20190501"  # today's date
    out_suffix = "uhs4-ea-fou-phenotype-file.txt" # suffix of outfile
    out_file = "{}-{}".format(date, out_suffix)

    df = pd.read_csv(phenotype_file, dtype=str) 
    cell_d, serum_d, gwas_d = gp_map(phenotype_file, var_list, df)
    filter_ids(genotype_file, var_list, out_file, cell_d, serum_d, gwas_d)



################################################################################
# view head of dictionaries
def glance(d, size):
    return dict(itertools.islice(d.items(), size))

def gp_map(phenotype_file, var_list, df):
    # initialized dictionaries that capture the variables-of-interest information for all subjects in phenotype file
    cell_dic = {}
    serum_dic = {}
    gwas_dic = {}

    for row in range(len(df)):
        # initialize list for key-value in dictionary
        cell_dic[df.loc[row, "cell_line"]] = []
        serum_dic[df.loc[row, "serum"]] = []
        gwas_dic[df.loc[row, "gwasserum"]] = []

        for var in var_list:
            cell_dic[df.loc[row, "cell_line"]].append(df.loc[row, var])
            serum_dic[df.loc[row, "serum"]].append(df.loc[row, var])
            gwas_dic[df.loc[row, "gwasserum"]].append(df.loc[row, var])

    print("Finished creating the genotype to phenotype map.\n")
    return cell_dic, serum_dic, gwas_dic

def newline(dic, identifier, otherline, outF):
    line = "\t".join(str(x) for x in dic[identifier])
    line = "{}\t{}".format(otherline.strip(), line)
    outF.write(line + "\n")

def filter_ids(genotype_file, var_list, out_file, cell_d, serum_d, gwas_d):
    with open(genotype_file) as inF, open(out_file, "w") as outF:
        field1 = "genotype_ID"
        head = "\t".join(var_list)
        head = "{}\t{}\n".format(field1, head)
        outF.write(head)
        line = inF.readline()
        while line:
            sl = line.split("_")
            asnum = sl[0] 
            hhg = sl[2]

            if hhg in cell_d:
                newline(cell_d, hhg, line, outF)
            elif asnum in serum_d:
                newline(serum_d, asnum, line, outF)
            elif asnum in gwas_d:
                newline(gwas_d, asnum, line, outF)
            else:
                print("Didn't find the following ID:\n")
                print(line)
                print(hhg)
            line = inF.readline()
    print("All done filtering down to the genotype IDs of interest jess.")
    print("your file is saved to: %s" %  out_file)


################################################################################
if __name__ == "__main__":
    main()


Finished creating the genotype to phenotype map.

All done filtering down to the genotype IDs of interest jess.
your file is saved to: 20190501-uhs4-ea-fou-phenotype-file.txt
