# Hi-C Opioids Genotype Data Processing
**Author:** Jesse Marks <br>
**GitHub Issue**: [#159](https://github.com/RTIInternational/bioinformatics/issues/159)<br>
**Initial sample size:** 155

The samples were genotyped on [Illumina Infinium Multi-Ethnic Global-8 v1.0](https://support.illumina.com/downloads/infinium-multi-ethnic-global-8-v1-product-files.html) genotyping arrays. The genotype data are provided in [Illumina TOP/BOT format](https://www.illumina.com/documents/products/technotes/technote_topbot.pdf) with A/B allele nomenclature.  

The genotype data are stored on S3 at `s3://rti-hic/shared_data/raw/cohort_0001/genotype/array/CaseWestern_Genotyping_Report_20201016.txt.gz`. 

[How to interpret DNA strand and allele information for Infinium genotyping array data](https://support.illumina.com/bulletins/2017/06/how-to-interpret-dna-strand-and-allele-information-for-infinium-.html)

# Pre-QC reformatting

Note that the gentoype data are provided in Illumina TOP/BOT format with A/B nomenclature. We will need to extract the genotypes, first. Then we will identify whether the variant is TOP or BOT, then convert the A/B to A,C,G, or T. We will then convert the genotype data Plink format and orient the genotypes to GRCh37 forward strand. 

In [None]:
## make directory structure
mkdir -p /shared/rti-hic/shared_data/raw/cohort_0001/genotype/array/info/
mkdir -p /shared/rti-hic/shared_data/pre_qc/cohort_0001/genotype/array/observed/0001/

## download manifest files for genotype array
cd /shared/rti-hic/shared_data/raw/cohort_0001/genotype/array/info
wget ftp://webdata2:webdata2@ussd-ftp.illumina.com/downloads/productfiles/multiethnic-global-8/v1-0/infinium-multi-ethnic-global-8-d1-csv.zip

unzip infinium-multi-ethnic-global-8-d1-csv.zip

## download genotype data
cd /shared/rti-hic/shared_data/raw/cohort_0001/genotype/array/
aws s3 cp s3://rti-hic/shared_data/raw/cohort_0001/genotype/array/CaseWestern_Genotyping_Report_20201016.txt.gz .

## upload genotype QC report and sample sheet from local
#CaseWestern_Multi-Ethnic_QC_Report_20201016.xlsx
#CaseWestern_Sample-Sheet_20201016.csv.gz

## Extract genotypes (A/B) from raw data

In [None]:
### Python3
"""
Process the raw genotype data for cohort 0001.
Pull out the pertinent data:
Name    Chr     Position        *.GType
"""

import gzip

infile = "/shared/rti-hic/shared_data/raw/cohort_0001/genotype/array/CaseWestern_Genotyping_Report_20201016.txt.gz"
outfile = "/shared/rti-hic/shared_data/pre_qc/cohort_0001/genotype/array/observed/0001/cohort_0001_casewestern_genotypes_a_b.txt"

with gzip.open(infile, 'rt') as inF, open(outfile, 'w') as outF:
    head = inF.readline()
    split_head = head.split("\t")

    # indices to keep
    snp = 1
    chrom = 3
    pos = 4

    # keep only headers of interest
    keep_head = []
    a_list = [ split_head[snp], split_head[chrom], split_head[pos] ]
    # genotypes are listed every fourth column starting at the 10th index
    b_list = [ split_head[i+6] for i in range(1, len(split_head)-6) if (i % 4) == 0 ]
    keep_head.extend(a_list)
    keep_head.extend(b_list)
    keep_head = "\t".join(keep_head) + "\n"
    outF.write(keep_head)

    line = inF.readline()
    # keep only columns of interest
    while line:
        sl = line.split()
        keep_list = []

        a_list = [ sl[snp], sl[chrom], sl[pos] ]
        # genotypes are listed every fourth column starting at the 10th index
        b_list = [ sl[i+6] for i in range(1, len(sl)-6) if (i % 4) == 0 ]
        keep_list.extend(a_list)
        keep_list.extend(b_list)

        keep_line = "\t".join(keep_list) + "\n"
        outF.write(keep_line)
        line = inF.readline()

*Note*: Lost 20,382 variants from our genotyped data upon converting from A/B nomenclature of Illumina TOP/BOT format using the [manifest file](https://support.illumina.com/downloads/infinium-multi-ethnic-global-8-v1-product-files.html). These SNPs were either not in the manifest file, or were indels whose coding was ambiguous. 

## Convert to ACTG & GRCh37 + strand

Using the [Manifest file (build 37)](https://support.illumina.com/downloads/infinium-multi-ethnic-global-8-v1-product-files.html) we will determine which Illumina strand the alleles are oriented to (TOP/BOT) and then convert the A/B nomenclature to ACGT nomenclature. Also orient the alleles to the GRCh37 + strand.

In [None]:
### Python3
"""
Convert the Illuma TOP/BOT formatted A/B nomenclature to
ACGT coding using the manifest for that specific 
Illumina Infinium chip.
"""

import re

manifest = '/shared/rti-hic/shared_data/raw/cohort_0001/genotype/array/info/Multi-EthnicGlobal_D1.csv'
infile = '/shared/rti-hic/shared_data/pre_qc/cohort_0001/genotype/array/observed/0001/cohort_0001_casewestern_genotypes_a_b.txt'
outfile = '/shared/rti-hic/shared_data/pre_qc/cohort_0001/genotype/array/observed/0001/cohort_0001_casewestern_genotypes_acgt_plus_strand.txt'


# flip alleles if GRCh37 strand is "-"
def flip_alleles(allele):
    flipper = {
            "A": "T",
            "T": "A",
            "C": "G",
            "G": "C",
            }

    new_alleles = [flipper[i] for i in allele] 
    new_alleles = "".join(new_alleles)
    return new_alleles


# determine which is allele A and B using Illumina TOP/BOT format
# input line and return both snp_name and a_b
#def orientation(ill_strand, grch_strand, alleles, markername):
def orientation(line):
    line = line.split(",")
    ill_strand = line[2]
    grch_strand = line[20]
    alleles = line[3]
    markername = line[1]

    a_b = re.sub(r'[\[\]]', '', alleles) # remove square brackets
    a_b = a_b.split('/')
    orient_dict = {}

    if ill_strand == "TOP" or ill_strand == "BOT":
        if grch_strand == "-":
            orient_dict["A"] = flip_alleles(a_b[0])
            orient_dict["B"] = flip_alleles(a_b[1])
            return orient_dict
        else:
            orient_dict["A"] = a_b[0]
            orient_dict["B"] = a_b[1]
            return orient_dict

    else: # else if the ill_strand is PLUS/MINUS (an indel)
        m_names = markername.split("-")
        m_names = m_names[1:]
        insertion = max(m_names) # largest string will be an insertion
        deletion = min(m_names)  # smallest string will be a deletion

        if a_b[0] == "I":
            a_b[0] = insertion
            a_b[1] = deletion
        else:
            a_b[1] = insertion
            a_b[0] = deletion

        if grch_strand == "-":
            orient_dict["A"] = flip_alleles(a_b[0])
            orient_dict["B"] = flip_alleles(a_b[1])
            return orient_dict
        else:
            orient_dict["A"] = a_b[0]
            orient_dict["B"] = a_b[1]
            return orient_dict


def convert_genotype(line, tmp_dict):
    line = line.split()
    snp = line[0]
    chrom = line[1]
    pos = line[2]
    genotypes = line[3:]

    for i in range(len(genotypes)):
        if genotypes[i] == "NC":
            genotypes[i] = '0:0' # (Per PLINK: '0' = no call)
        else:
            chrom_pos = "{}:{}".format(chrom, pos)
            if chrom_pos == "0:0":
                chrom_pos = snp
            genoA = tmp_dict[chrom_pos][genotypes[i][0]]
            genoB = tmp_dict[chrom_pos][genotypes[i][1]]

            sample = "{}:{}".format(genoA, genoB)
            genotypes[i] = sample
    new_list = line[0:3] + genotypes
    new_line = "\t".join(new_list) + "\n"
    return new_line

####################################################################################################
####################################################################################################

with open(manifest) as manF, open(infile) as inF, open(outfile, 'w') as outF:

    # skip comment lines in the header
    line = manF.readline()
    while (line[0:6] != 'IlmnID'):
        line = manF.readline()
    print(line)
        
    line = line.strip().split(',')
    markername = 1
    chrom = 9
    pos = 10

    # dictionary will contain allele A and B for the GRCh + strand
    man_dict = {}

    # run through whole manifest file
    line = manF.readline()
    count = 0
    while line:
        sl = line.strip().split(',')
        try:
            a_b = orientation(line) # GRCh37 + strand alleles A & B
            name = "{}:{}".format(sl[chrom], sl[pos])
            
            if name == "0:0": # some of the variants had 0:0 for chrom:pos
                name = sl[markername] # just use the given variant name

            man_dict[name] = a_b
            line = manF.readline()
        except:
            line = manF.readline()
    

    # now run through our actual gentypes
    head = inF.readline()
    outF.write(head)
    line = inF.readline()
    while line:
        try:
            out_line = convert_genotype(line, man_dict) # convert the A's and B's to the appropriate ACGT format
            outF.write(out_line)
            line = inF.readline()
        except:
            line = inF.readline()  # some SNPs were not found in the manifest (e.g.  1:155204987)
            

## tped & tfam
The data must be convert to tped and tfam before we can convert to Plink's Bed, Bim, Fam format. 

In [None]:
### python3
"""
Convert file to a tped and a tfam file.
"""
import re

infile = '/shared/rti-hic/shared_data/pre_qc/cohort_0001/genotype/array/observed/0001/cohort_0001_casewestern_genotypes_acgt_plus_strand.txt'
outfile1 = '/shared/rti-hic/shared_data/pre_qc/cohort_0001/genotype/array/observed/0001/cohort_0001_casewestern_genotypes_acgt_plus_strand.tfam'
outfile2 = '/shared/rti-hic/shared_data/pre_qc/cohort_0001/genotype/array/observed/0001/cohort_0001_casewestern_genotypes_acgt_plus_strand.tped'

with open(infile) as inF, open(outfile1, 'w') as tFAM, open(outfile2, 'w') as tPED:
    head = inF.readline()
    head_ids = head.split()
    head_ids = head_ids[3:]

    # create tfam file
    for item in head_ids:
        item = re.sub(r'\.GType', '', item)
        tfam_line = '{}\t{}\t0\t0\t0\t0\n'.format(item, item)
        tFAM.write(tfam_line)


    # create tped file
    line = inF.readline()
    while line:
        sl = line.split()
        variant_id = sl[0]
        chrom = sl[1]
        pos = sl[2]
        genotypes = sl[3:]
        genotypes = [ re.sub(r':', '\t', item) for item in genotypes ]
        genotypes = '\t'.join(genotypes)
        outline = '{}\t{}\t0\t{}\t{}\n'.format(chrom, variant_id, pos, genotypes) 
        tPED.write(outline)
        line = inF.readline()


## Convert to BED/BIM/FAM

In [None]:
cd /shared/rti-hic/shared_data/pre_qc/cohort_0001/genotype/array/observed/0001

/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --tfile cohort_0001_casewestern_genotypes_acgt_plus_strand \
    --make-bed \
    --out casewestern_cohort_0001_genotypes
    

## Add sex information
Add the sex information to the FAM file. This can be found in the file 
`CaseWestern_Multi-Ethnic_QC_Report_20201016.xlsx`
We will create a map file from this Excel file and then add the sex info to the FAM file.

```
head CaseWestern_Multi-Ethnic_QC_Report_20201016_sexmap.txt

Sample_ID       Gender
8042579044_JN0000001_HCCCP      Male
8042579056_JN0000013_HCTVE      Male
8042579068_JN0000025_HCTQQ      Male
8042579080_JN0000037_NIH1889    Female
8042579092_JN0000049_HCTMB      Male
8042579104_JN0000061_HCTGD      Male
8042579116_JN0000073_HCTBA      Female
```

In [None]:
### Python3
"""
Use a map file to enter sex information into the FAM file
"""

famfile = 'casewestern_cohort_0001_genotypes.fam'
infile = 'CaseWestern_Multi-Ethnic_QC_Report_20201016_sexmap.txt'
outfile = 'casewestern_cohort_0001_genotypes_sex_mapped.fam'

with open(famfile) as famF, open(infile) as inF, open(outfile, 'w') as outF:
    head = inF.readline()

    # contains map to sex
    sex_dic = {}

    # populate the sex dictionary by looping through map file
    line = inF.readline()
    while line:
        sl = line.split()
        sex = sl[1]
        if sex == 'Male':
            sex = '1'
        elif sex == 'Female':
            sex = '2'
        sex_dic[sl[0]] = sex
        line = inF.readline()

    # now loop through the fam file (also remove positive controls)
    line = famF.readline()
    count = 0
    while line:
        sl = line.split()
        
        # we will remove these with PLINK
        # need to rename NA subjects because they throw an error in PLINK when trying to remove them 
        if sl[0] == 'Positive_Control_NA_NA' or sl[0] == 'NA':
            count += 1
            sl[0] += str(count)
            sl[1] += str(count)
            outline = ' '.join(sl) + '\n'
            outF.write(outline)
            line = famF.readline()
            continue
            
        sl[4] = sex_dic[sl[0]]
        new_line = ' '.join(sl) + '\n'
        outF.write(new_line)
        line = famF.readline()


## Remove positive controls

In [None]:
cd /shared/rti-hic/shared_data/pre_qc/cohort_0001/genotype/array/observed/0001

# the bottom 5 subjects were either NA or positive controls
cut -d' ' -f1 casewestern_cohort_0001_genotypes_sex_mapped.fam | head -155 > keep_list.txt
    
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bed casewestern_cohort_0001_genotypes.bed \
    --bim casewestern_cohort_0001_genotypes.bim \
    --fam casewestern_cohort_0001_genotypes_sex_mapped.fam \
    --keep-fam keep_list.txt \
    --make-bed \
    --out casewestern_cohort_0001
    

## Upload to S3

In [None]:
cd /shared/rti-hic/shared_data/pre_qc/cohort_0001/genotype/array/observed/0001

for ext in {bed,bim,fam}; do
    gzip casewestern_cohort_0001.$ext 
    aws s3 cp  casewestern_cohort_0001.$ext.gz s3://rti-hic/shared_data/pre_qc/cohort_0001/genotype/array/observed/0001/
done


# QC

View processing [here](https://github.com/RTIInternational/bioinformatics/blob/master/methods/rti-hic/shared_data/post_qc/cohort_0001/genotype/array/observed/0001/post_qc_cohort_0001_genotype_array_0001.md).