# Data processing for dbGaP phs000428.v2.p2.c1 (retirement)
Genetics Resource with the Health and Retirement Study (phs000428.v2.p2)

**Author:** Jesse Marks

This document logs several components of data processing for [dbGaP study phs000428.v2.p2.c1](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000428.v2.p2) including 

* Data retrieval
* Genotype data quality control.

The purpose of processing these data are to prepare them for further processing and analysis steps such as haplotype phasing, imputation, and genome-wide association analysis.

## Software and tools
The software and tools used for processing these data are
* Windows 10 with [Cygwin](https://cygwin.com/) installed 
* [Aspera Connect](http://downloads.asperasoft.com/downloads)
* [KING](http://people.virginia.edu/~wc9c/KING/)
* [PLINK v1.9 beta 3.45](https://www.cog-genomics.org/plink/) 
* [SRA toolkit](https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/)
* [STRUCTURE](https://web.stanford.edu/group/pritchardlab/structure.html)
* [R v3.2.3](https://www.r-project.org/)
* [iGraph (R package)](http://igraph.org/r/)

## Data retieval
### Genotypes and phenotypes
The genotype and phenotype data were downloaded to a local machine and then transfered to Amazon Simple Storage Service (S3). Most of the data were not decrypted locally and will thus need to be decrypted on Amazon Elastic Compute Cloud (EC2). The data was downloaded using the Aspera Connect Browser plug-in. Note that these data require authorized access, so the [authorized access portal](https://dbgap.ncbi.nlm.nih.gov/dbgap/aa/wga.cgi?page=login) must be used (request login information from Eric Johnson). The data files downloaded from dbGaP are encrypted and thus will need to be decrypted using `vdb-decrypt` from the SRA toolkit [(instructions here)](https://www.ncbi.nlm.nih.gov/books/NBK63512/#_Download_Points_often_Ignored_When_Decry_).

Note: As a way to conserve space, certain genotype data for a given study are excluded from download. General criteria are:
* Exclude imputed data
* Exclude individual format genotype calls if the matrix and/or PLINK binary fileset format is available
* Exclude index files that lay out the directory structure for the individual format genotype calls
* Exclude raw array data if genotype calls are available

To assess which files may be unnecessary for download, the study report available through the [public FTP download site]() (accessible via the dbGaP landing page for a given study accession) should be examined.

### General directory structure setup

After locally downloading, all of the dbGaP data should be organized within a directory called ncbi. For each cohort, its data needs to be placed within a subdirectory of the format dbGaP-x where x signifies the project number of the download (not to be confused with the download request number). The project number can be found on the "Downloads" tab of the dbGaP authorized access portal. Also in that tab is the link to download the repository key file. This file should be placed in the top level of the project folder then imported using [vdb-config](https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=std). Although each download request provides a link to download the repository key, only one key file is needed per project.

**Note:** File names longer than the allowed Windows file name character limit will not decrypt and must be renamed before decrypting. It is highly recommended to check the file names after decryption to ensure that they successfully decrypted.

#### S3 directory structure
    rti-common
    |
    |-- dbgap_studies
    |    |
    |     ` -- <dbGaP study name>
    |    |    |
    |    |     ` -- meta
    |    |    |
    |    |     ` -- genotype
    |    |    |    |    
    |    |    |     ` -- original
    |    |    |    |     |    
    |    |    |    |      ` -- unprocessed
    |    |    |    |     |    
    |    |    |    |      ` -- processing
    |    |    |    |     |    
    |    |    |    |      ` -- final
    |    |    |    |
    |    |    |     ` -- imputed
    |    |    |
    |    |     ` -- phenotype
    |    |    |    |
    |    |    |     ` -- unprocessed
    |    |    |    |
    |    |    |     ` -- processing
    |    |    |    |
    |    |    |     ` -- final 

###  Install Amazon Web Services Command Line Interface (AWS CLI) 
The Amazon Web Services Command Line Interface (AWS CLI) needs to be installed in order to upload the data from a local machine to S3.

In [None]:
# Install pip, a package manager for python applications on local machine
curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py"
python get-pip.py

# Install awscli via pip
pip install awscli

# Verify installation - should see something similar to the string below
aws --version
"aws-cli/1.11.178 Python/2.7.13 CYGWIN_NT-10.0/2.9.0(0.318/5/3) botocore/1.7.36"

### Configure AWS 
The settings for using the AWS CLI need to be configured before interacting with AWS. These configurations include security credentials and the default region. For more information on this process, see [here](http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html).

**Note:** these only need to be configured once. 

In [None]:
# Local machine (Cygwin)
aws configure

AWS Access Key ID [None]: AKIAJONBCJHOJSW2PFJA
AWS Secret Access Key [None]:  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
Default region name [None]: us-east-1
Default output format [None]: text  # could be json, text, or table

### Genetics Resource with the Health and Retirement Study (phs000428.v2.p2)
Genetics Resource with the Health and Retirement Study 
[dbGaP study phs000428.v2.p2.c1](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000428.v2.p2)
* Download 58712

In [None]:
## Local machine (Cygwin) ##

# Decrypt (must be executed within project directory)
cd /cygdrive/c/Users/jmarks/ncbi/dbGaP-2556
../../Desktop/sratoolkit/sratoolkit.2.8.2-1-win64/bin/vdb-decrypt.exe -v 58712

# Check for successful decryption
if [ $(find 58712/ -name *ncbi_enc -print | wc -l) = 0 ]; then echo "Success!"; else echo "Failed!"; fi

# Create directory structure
mkdir -p s3/phs000428_retirement/{meta,genotype,phenotype}
mkdir -p s3/phs000428_retirement/genotype/original/{final,unprocessed,processing}
mkdir s3/phs000428_retirement/phenotype/{final,unprocessed,processing}

# Move files to directory structure
cd /cygdrive/c/Users/jmarks/ncbi/dbGaP-2556/58712/PhenoGenotypeFiles/RootStudyConsentSet_phs000428.CIDR_Aging_Omni1.v2.p2.c1.NPR
mv GenotypeFiles/* /cygdrive/c/Users/jmarks/ncbi/dbGaP-2556/s3/phs000428_retirement/genotype/original/unprocessed/
mv PhenotypeFiles/* /cygdrive/c/Users/jmarks/ncbi/dbGaP-2556/s3/phs000428_retirement/phenotype/unprocessed/
mv StudyMetaFiles/* /cygdrive/c/Users/jmarks/ncbi/dbGaP-2556/s3/phs000428_retirement/meta/

# Upload to S3
cd /cygdrive/c/Users/jmarks/ncbi/dbGaP-2556/s3/
aws s3 mv phs000428_retirement/ s3://rti-common/dbGaP/phs000428_retirement/ --recursive

## Genotype processing

### Connecting to AWS - using Cygwin
To connect to AWS using the SSH client Cygwin, you will need to utilize PuTTYgen on Windows to generate SSH key pairs. The two links [here](https://stackoverflow.com/questions/2224066/how-to-convert-ssh-keypairs-generated-using-puttygenwindows-into-key-pairs-use) and [here](https://stackoverflow.com/questions/2419566/best-way-to-use-multiple-ssh-private-keys-on-one-client) explain how to convert the Putty key into key-pairs and then how to conveniently login to AWS. The private key from PuTTygen output is saved to:

```
~/.ssh/gwas_rsa
```
This private key is then referenced in the ssh config file ```~/.ssh/config```. In the config file, I use the shortname ```AWS``` so that I can login with 

```ssh AWS
```.

**Note:**  `User` for the config file is ```jmarks``` - in my case - and ```HostName``` is the IP address ```50.19.195.254```.

* Changed maintain_initial_size parameter to ```true```.
* ebs_snapshot_id = snap-06d1702c532ba16c7
* volume_type = gp2

**Note:** the Perl scripts will have a slightly different path than the ones that were on MIDAS.
For other software, such as R, the paths will probably be the same but the names slightly altered.

### Creating an instance on EC2

In [None]:
# local machine
ssh AWS

# cluster config server
cnfcluster create 428retirement

'''   Output:"MasterPublicIP"="34.225.135.213"
Output:"MasterPrivateIP"="172.31.16.41"
Output:"GangliaPublicURL"="http://34.225.135.213/ganglia/"
Output:"GangliaPrivateURL"="http://172.31.16.41/ganglia/"
'''

# local machine, note that username is ec2-user
ssh -i ~/.ssh/gwas_rsa ec2-user@34.225.135.213

### Retrieving data from S3
To avoid writing over the current directory structure, I create a new directory within the directory structure of the other studies.

In [None]:
## EC2 ##
# Retrieve data from S3 and store in EC2
cd /shared/data/studies
aws s3 cp s3://rti-common/dbGaP/phs000428_retirement/ phs000428_retirement --recursive

# create directory structure
cd /shared/data/studies/phs000428_retirement/
mkdir phenotype/{final,processing} 
mkdir genotype/original/{final,processing}

# decrypt and rename phenotype data
cp phenotype/unprocessed/*.gz phenotype/processing
gunzip phenotype/processing/*    

# decrypt and rename genotype data
cp genotype/original/unprocessed/* genotype/original/processing
gunzip -r genotype/original/processing/*.gz

#### Retrieve necessary files

In [None]:
## Local Machine ##

# Retrieve file from MIDAS
scp jmarks@rtplhpc01.rti.ns:/share/nas03/bioinformatics_group/common/build_conversion/\
b37/dbsnp_b138/uniquely_mapped_snps.chromosomes .

# Upload to EC2
scp -i ~/.ssh/gwas_rsa uniquely_mapped_snps.chromosomes ec2-user@34.225.135.213:\
/shared/bioinformatics/methods/nas03/bioinformatics_group/common/build_conversion/b37/dbsnp_b138/



# Retrieve file from MIDAS
scp jmarks@rtplhpc01.rti.ns:/share/nas03/bioinformatics_group/common/build_conversion/\
b37/dbsnp_b138/uniquely_mapped_snps.positions .

# Upload to EC2
scp -i ~/.ssh/gwas_rsa uniquely_mapped_snps.positions ec2-user@34.225.135.213:\
/shared/bioinformatics/methods/nas03/bioinformatics_group/common/build_conversion/b37/dbsnp_b138/



# Retrieve file from MIDAS
scp jmarks@rtplhpc01.rti.ns:/share/nas03/bioinformatics_group/common/build_conversion/\
b37/dbsnp_b138/uniquely_mapped_snps.ids .

# Upload to EC2
scp -i ~/.ssh/gwas_rsa uniquely_mapped_snps.ids ec2-user@34.225.135.213:\
/shared/bioinformatics/methods/nas03/bioinformatics_group/common/build_conversion/b37/dbsnp_b138/cd /shared/data/studies/phs000428_retirement/genotype/original/processing



# Retrieve file from MIDAS
scp jmarks@rtplhpc01.rti.ns:/share/nas03/bioinformatics_group/common/snp_id_conversion/\
b138/old_to_current.xref .

# Upload to EC2
scp -i ~/.ssh/gwas_rsa old_to_current.xref ec2-user@34.225.135.213:\
/shared/bioinformatics/methods/nas03/bioinformatics_group/common/snp_id_conversion/b138/



# Retrieve file from MIDAS
for chr in {1..22};
do
scp jmarks@rtplhpc01.rti.ns:/share/nas03/bioinformatics_group/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr$chr.legend.gz temp1000/
done

# Upload to EC2
scp -i ~/.ssh/gwas_rsa * ec2-user@34.225.135.213:\
/shared/bioinformatics/methods/nas03/bioinformatics_group/data/ref_panels/1000G/2014.10/

### Quality Control Sample Tracking
#### Pre-chromosome type partitioning
The table below provides statistics on variants and subjects filtered during each step of the QC process.

| QC procedure                         | Variants removed | Variants retained | Subjects removed | Subjects retained |
|--------------------------------------|------------------|-------------------|------------------|-------------------|
| Initial dbGaP dataset                | 0                | 2315518           | 0                | 2559              |
| Genome build 37 and dbGaP 138 update | 200149           | 2115369           | 0                | 2559              |

#### Autosome statistics
This table includes filtering statistics prior to merging with chrX.

| QC procedure                                    | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
|-------------------------------------------------|------------------|-------------------|------------------|----------------|-------------------|
| Pre-partitioning w/initial procedures (all chr) | 6,920            | 963,422           | 0                | 0              | 1,170             |
| Partitioning to only autosomes                  | 27,356           | 936,066           | 0                | 0              | 1,170             |
| Remove subjects missing whole autosome data     | 0                | 936,066           | 0                | 0              | 1,170             |
| Duplicate rsID filtering                        | 0                | 936,066           | 0                | 0              | 1,170             |
| Remove ancestral outliers                       | 0                | 936,066           | 6                | 0              | 1,164             |
| Remove sujects with re-assigned ancestry        | 0                | 936,066           | 39               | 0              | 1,125             |
| Add subjects re-assigned by STRUCTURE           | 0                | 936,066           | 0                | 62             | 1,187             |
| Remove variants with missing call rate > 3%     | 78,222           | 857,844           | 0                | 0              | 1,187             |
| Remove variants with HWE p < 0.0001             | 1,663            | 856,181           | 0                | 0              | 1,187             |

#### ChrX statistics¶
This table includes filtering statistics prior to merging with autosomes.

| QC procedure                                    | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
|-------------------------------------------------|------------------|-------------------|------------------|----------------|-------------------|
| Pre-partitioning w/initial procedures (all chr) | 6,920            | 963,422           | 0                | 0              | 1,170             |
| Partitioning to only autosomes                  | 27,356           | 936,066           | 0                | 0              | 1,170             |
| Remove subjects missing whole autosome data     | 0                | 936,066           | 0                | 0              | 1,170             |
| Duplicate rsID filtering                        | 0                | 936,066           | 0                | 0              | 1,170             |
| Remove ancestral outliers                       | 0                | 936,066           | 6                | 0              | 1,164             |
| Remove sujects with re-assigned ancestry        | 0                | 936,066           | 39               | 0              | 1,125             |
| Add subjects re-assigned by STRUCTURE           | 0                | 936,066           | 0                | 62             | 1,187             |
| Remove variants with missing call rate > 3%     | 78,222           | 857,844           | 0                | 0              | 1,187             |
| Remove variants with HWE p < 0.0001             | 1,663            | 856,181           | 0                | 0              | 1,187             |

#### Merged autosome and chrX statistics

| QC procedure                                            | Variants removed | Variants retained | Subjects removed | Subjects added | Subjects retained |
|---------------------------------------------------------|------------------|-------------------|------------------|----------------|-------------------|
| Merge autosomes and chrX                                | 0                | 880,306           | 0                | 0              | 1,187             |
| Remove subjects with IBD > 0.4, IBS > 0.9, KING > 0.177 | 0                | 880,306           | 5                | 0              | 1,182             |
| Remove subjects with missing call rate > 3%             | 0                | 880,306           | 0                | 0              | 1,182             |
| Sex discordance filter                                  | 0                | 880,306           | 0                | 0              | 1,182             |
| Excessive homozygosity filter                           | 0                | 880,306           | 0                | 0              | 1,182             |
| Duplicate variant ID filter after 1000G renaming        | 4                | 880,302           | 0                | 0              | 1,182             |

### Ancestry partitioning
The data contains AA subjects as well as non-AA subjects. We will only focus on the AA subjects for now.

In [3]:
## EC2 ##
cd /shared/data/studies/phs000428_retirement/phenotype/processing

# grab the subject ids of the AA subjects
tail -n +12 phs000428.v2.pht002614.v2.p2.c1.HRS_Subject_Phenotypes.NPR.txt | awk '{ if($5==1){print $2 } }
' > aa_subjects_ids.txt

ERROR: Error in parse(text = x, srcfile = src): <text>:5:13: unexpected symbol
4: # grab the subject ids of the AA subjects
5: tail -n +12 phs000428.v2.pht002614.v2.p2.c1.HRS_Subject_Phenotypes.NPR.txt
               ^


#### Exclusion of subjects without phenotype data
The .fam file may contain more subject IDs than the phenotype file. The subjects without phenotype data are excluded as they provide no benefit for GWA.

In [None]:
cd /shared/data/studies/phs000428_retirement/genotype/original/processing/

mkdir aa

# Add family IDs
grep -f ../../../phenotype/processing/aa_subject_ids.txt \
    phase123.genotype-calls-matrixfmt.c1/subject_level_PLINK_sets/HRS_phase123_TOP.fam | \
    cut -d ' ' -f 1,2    \
    > ../../../phenotype/processing/aa_subject_ids.keep

# Create filtered PLINK filesets
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 2048 \
        --bfile  phase123.genotype-calls-matrixfmt.c1/subject_level_PLINK_sets/HRS_phase123_TOP \
        --keep ../../../phenotype/processing/aa_subject_ids.keep \
        --make-bed \
        --out aa/genotypes

### Map kgp IDs to rs IDs
The genotype information is given in kgp format. We need to map this to rs IDs before we update the dbSNP and genonome build. Duplicates may arise and will need to be filtered based on missing call rate. The variant with the highest missing call rate will be removed. 

In [None]:
## EC2 ##
/shared/data/studies/phs000428_retirement/genotype/original/processing

# change conversion file to tab-separated
sed 's/,/\t/g' phase1.marker-info.MULTI/SNP_kgpID2rsID.csv >\
phase1.marker-info.MULTI/SNP_kgpID2rsID.tsv

# remove duplicate IDs 
sort -u -k 1,1 phase1.marker-info.MULTI/SNP_kgpID2rsID.tsv >\
phase1.marker-info.MULTI/SNP_kgpID2rsID.sorted


# convert kgp ID to rs ID
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 2048 \
        --bfile aa/genotypes \
        --update-name phase1.marker-info.MULTI/SNP_kgpID2rsID.sorted \
        --make-bed \
        --out aa/genotypes_kgpID2rsID

####  Remove duplicate rsIDs

In [None]:
## EC2 ##
cd /shared/data/studies/phs000428_retirement/genotype/original/processing

# Append _X (where X is a number) to the end of the rs IDs for all but 1st occurrence of duplicates
perl -lane 'BEGIN { %idCounts = (); }
                    if (exists($idCounts{$F[1]})) {
                        $idCounts{$F[1]}++;
                        print join("\t",$F[0],$F[1]."_".$idCounts{$F[1]},$F[2],$F[3],$F[4],$F[5]);
                    } else {
                        $idCounts{$F[1]} = 1;
                        print;
                    } ' aa/genotypes_kgpID2rsID.bim > aa/genotypes_kgpID2rsID_renamed.bim

# create a list of the duplicate snps
grep _ aa/genotypes_kgpID2rsID_renamed.bim | perl -lane \
'print substr($F[1], 0, index($F[1],"_"))."\n".$F[1];' >\
aa/genotypes_kgpID2rsID_renamed.duplicate_snps


# Get call rates for duplicate SNPs
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --bed aa/genotypes_kgpID2rsID.bed \
            --bim aa/genotypes_kgpID2rsID_renamed.bim \
            --fam aa/genotypes_kgpID2rsID.fam \
            --extract aa/genotypes_kgpID2rsID_renamed.duplicate_snps \
            --missing \
            --out aa/genotypes_kgpID2rsID.duplicate_snps.missing

# Create remove list for duplicates containing duplicate with higher missing rate
tail -n +2 aa/genotypes_kgpID2rsID.duplicate_snps.missing.lmiss |\
 perl -lane 'BEGIN { %missingness = (); }
                        if ($F[1] =~ /^(\S+)_/) {
                            $duplicateName = $1
                        } else {
                            $duplicateName = $F[1]."_2";
                        }
                        if (exists($missingness{$duplicateName})) {
                            if ($missingness{$duplicateName} > $F[4]) {
                                print $duplicateName;
                            } else {
                                print $F[1];
                            }
                        } else {
                            $missingness{$F[1]} = $F[4];
                        }' > aa/genotypes_kgpID2rsID.duplicate_snps.remove

# Remove duplicates with higher missing rate
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
   --bed /shared/data/studies/phs000428_retirement/genotype/original/processing/aa/genotypes_kgpID2rsID.bed \
   --bim /shared/data/studies/phs000428_retirement/genotype/original/processing/aa/genotypes_kgpID2rsID_renamed.bim \
   --fam /shared/data/studies/phs000428_retirement/genotype/original/processing/aa/genotypes_kgpID2rsID.fam \
   --exclude /shared/data/studies/phs000428_retirement/genotype/original/processing/aa/genotypes_kgpID2rsID.duplicate_snps.remove \
   --make-bed \
   --out /shared/data/studies/phs000428_retirement/genotype/original/processing/aa/genotypes_kgpID2rsID_duplicates_removed

# Remove "_2" from rs IDs
perl -i.bak -lne 's/_2//; print;'  aa/genotypes_kgpID2rsID_duplicates_removed.bim
 


### Update dbSNP and genome build

To ensure that all of the population controls have variant and genomic data in dbSNP 138 and genome build 37 format, I use ID and position mappers to make the updates.

In [None]:
## EC1 ##
cd /shared/data/studies/phs000428_retirement/genotype/original/processing

# update name to dbSNP138
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 2048 \
        --bfile aa/genotypes_kgpID2rsID_duplicates_removed \
        --update-name /shared/bioinformatics/methods/nas03/bioinformatics_group/common/snp_id_conversion/b138/old_to_current.xref \
        --make-bed \
        --out aa/genotypes_kgpID2rsID_duplicates_removed_rsid_update


# There are still some duplicates that need to be taken care of again.
# Append _X (where X is a number) to the end of the rs IDs for all but 1st occurrence of duplicates
perl -lane 'BEGIN { %idCounts = (); }
                    if (exists($idCounts{$F[1]})) {
                        $idCounts{$F[1]}++;
                        print join("\t",$F[0],$F[1]."_".$idCounts{$F[1]},$F[2],$F[3],$F[4],$F[5]);
                    } else {
                        $idCounts{$F[1]} = 1;
                        print;
                    } ' aa/genotypes_kgpID2rsID_duplicates_removed_rsid_update.bim > aa/genotypes_kgpID2rsID_duplicates_removed_rsid_update_X.bim

# create a list of the duplicate snps
grep _ aa/genotypes_kgpID2rsID_duplicates_removed_rsid_update_X.bim | perl -lane \
'print substr($F[1], 0, index($F[1],"_"))."\n".$F[1];' >\
aa/genotypes_kgpID2rsID_renamed.duplicate_snps_X

# remove the snps from the list that do not have _X in the id
grep '_' aa/genotypes_kgpID2rsID_renamed.duplicate_snps_X > aa/genotypes_kgpID2rsID_renamed.duplicate_snps_X.remove
aa/genotypes_kgpID2rsID_renamed.duplicate_snps_X.remove

# Remove duplicates with _ in name
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
   --bed aa/genotypes_kgpID2rsID_duplicates_removed_rsid_update.bed \
   --bim aa/genotypes_kgpID2rsID_duplicates_removed_rsid_update_X.bim \
   --fam aa/genotypes_kgpID2rsID_duplicates_removed_rsid_update.fam \
   --exclude aa/genotypes_kgpID2rsID_renamed.duplicate_snps_X.remove \
   --make-bed \
   --out aa/genotypes_kgpID2rsID_duplicates_removed_rsid_update_clean



# Update variant chr 
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 2048 \
        --bfile aa/genotypes_kgpID2rsID_duplicates_removed_rsid_update_clean \
        --update-chr /shared/bioinformatics/methods/nas03/bioinformatics_group/common/\
build_conversion/b37/dbsnp_b138/uniquely_mapped_snps.chromosomes \
        --make-bed \
        --out aa/genotypes_kgpID2rsID_duplicates_removed_rsid_update_clean_chr_b37
# Note:  688235 values updated, 60400964 variant IDs not present.


# Update variant chr coordinate
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 2048 \
        --bfile aa/genotypes_kgpID2rsID_duplicates_removed_rsid_update_clean_chr_b37 \
        --update-map /shared/bioinformatics/methods/nas03/bioinformatics_group/common/\
build_conversion/b37/dbsnp_b138/uniquely_mapped_snps.positions \
        --make-bed \
        --out aa/genotypes_kgpID2rsID_duplicates_removed_rsid_update_clean_chr_position_b37

# Filter to only build 37 uniquely mapped variants
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 2048 \
        --bfile aa/genotypes_kgpID2rsID_duplicates_removed_rsid_update_clean_chr_position_b37 \
        --extract /shared/bioinformatics/methods/nas03/bioinformatics_group/common/\
build_conversion/b37/dbsnp_b138/uniquely_mapped_snps.ids \
        --make-bed \
        --out aa/genotypes_b37_dbsnp138

### intermittent upload to S3
As a precaution, we should upload our results to S3 periodically for safe keeping.

In [None]:
## EC2 ##
cd /shared/data/studies/phs000428_retirement/genotype/original/processing/
aws s3 cp aa/ s3://rti-common/dbGaP/phs000428_retirement/genotype/original/processing/ --recursive

### Partition into autosome and chrX groups
I apply QC to autosomes and chrX separately, so separate subdirectories are created for the processing of each set.

In [None]:
## EC2 ##
cd /shared/data/studies/phs000428_retirement/genotype/original/processing/aa

mkdir autosomes chrX

# Autosomes
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile genotypes_b37_dbsnp138 \
    --autosome \
    --make-bed \
    --out autosomes/genotypes_b37_dbsnp138

# ChrX (include split PARs)
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile genotypes_b37_dbsnp138 \
    --chr 23,25 \
    --make-bed \
    --out chrX/genotypes_b37_dbsnp138_unmerged

# Combine split chrX and PARs
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile chrX/genotypes_b37_dbsnp138_unmerged \
    --merge-x \
    --make-bed \
    --out chrX/genotypes_b37_dbsnp138

### Missing autosome data subject filtering
We calculate the proportion of missing genotype calls per chromosome using PLINK to assess whether any subjects have data missing for whole autosomes.

In [None]:
cd /shared/data/studies/phs000428_retirement/genotype/original/processing/aa

for chr in {1..22}; do
    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name aa_${chr} \
        --script_prefix autosomes/chr${chr}_missing_call_rate \
        --mem 3.8 \
        --priority 0 \
        --program  /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
            --noweb \
            --bfile autosomes/genotypes_b37_dbsnp138 \
            --missing \
            --chr $chr \
            --out autosomes/chr${chr}_missing_call_rate
done

for chr in {1..22}; do
    tail -n +2 autosomes/chr${chr}_missing_call_rate.imiss | \
        awk '{ OFS="\t" } { if($6==1){ print $1,$2 } }' >> autosomes/missing_whole_autosome.remove

For this case none of the subjects had missing autosome data. If subjects ever show up as having missing autosome data then further discussions need to be had on whether these subjects should be removed completely or whether they should only be excluded for the missing chromosomes.

In [None]:
# Clean up 
rm autosomes/chr*missing_call_rate*
rm autosomes/missing_whole_autosome.remove

### Remove duplicate SNPs
If multiple rsIDs are present then the one with the better genotype call rate across subjects should be retained. Obtaining the genotype call rates across subjects would need to be calculated using PLINK --missing.

In [None]:
cd /shared/data/studies/phs000428_retirement/genotype/original/processing/aa

# Find duplicate rsIDs
cut -f2,2 autosomes/genotypes_b37_dbsnp138.bim | sort | uniq -D > autosomes/variant_duplicates.txt

For this case there are no duplicated rsIDs.

### Recoding variants for 1000G phase 3
RefSeq IDs (rsIDs) for variants can vary depending on the dbSNP build used and variant IDs do not all use RefSeq nomenclature. To provide a common nomenclature that will make comparisons across data sets feasible, I use a script that recodes all variant names to match 1000G phase 3 variants by position and alleles. The 1000G Phase 3 data I used for STRUCTURE are from /share/nas03/bioinformatics_group/data/ref_panels/1000G/2013.05/plink on MIDAS, but from correspondence with Nathan Gaddis I learned that /share/nas03/bioinformatics_group/data/ref_panels/1000G/2014.10/ also contains 1000G Phase 3 data derived from the May 2013 release. The difference is that it was downloaded from the IMPUTE2 website and reformatted to be directly compatible with IMPUTE2.

The data in /share/nas03/bioinformatics_group/data/ref_panels/1000G/2014.10/ is used for variant name recoding, but the 1000G genotype information is acquired from /share/nas03/bioinformatics_group/data/ref_panels/1000G/2013.05/plink.

#### Filtered study data file renaming

In [None]:
## EC2 ##
cd /shared/data/studies/phs000428_retirement/genotype/original/processing

mkdir 1000g_name_recoding

ancestry="aa"
for ext in {bed,bim,fam}; do
    cp  ${ancestry}/genotypes_b37_dbsnp138.${ext} 1000g_name_recoding/${ancestry}_chr_all.${ext}
done

#### Variant ID updating
Because the 1000G data and the study data have RefSeq IDs (rsIDs) from different dbSNP builds, I standardize them using convert_to_1000g_ids.v4.pl. In the study data set, certain indels may be represented as two variants, a monomorphic variant and an indel with the - symbol for one of the alleles. For example:
```
1   rs201826967  0.809   57873968   0   G
1   rs11284630   0.809   57873969   -   A
```

These two variants represent a G:GA indel and is coded as such in the 1000 Genomes data. The script to update the names to 1000 Genomes IMPUTE2 format will assign the same ID to these two variants. The duplicated IDs will cause problems for PLINK filtering, so I will remove the variant from a set of duplicate IDs that has the lower genotype call rate. Duplicates may arise for other reasons, and they will be filtered based on the same criterion.

In [None]:
## EC2 ##
cd /shared/data/studies/phs000428_retirement/genotype/original/processing/1000g_name_recoding/
mkdir 1000g_data

# Break out data by chr
ancestry="aa"
for chr in {1..23}; do
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --memory 2048 \
        --bfile ${ancestry}_chr_all \
        --chr ${chr} \
        --make-bed \
        --out ${ancestry}_chr${chr}
done


# /shared/bioinformatics/methods/nas03/bioinformatics_group/data/ref_panels/1000G/2014.10/

# Rename study autosome variant IDs
ancestry="aa"
for chr in {1..22}; do
    /shared/bioinformatics/software/scripts/qsub_job.sh \
        --job_name recode_to_1000g_${chr} \
        --script_prefix ${ancestry}_chr${chr}_id_rename \
        --mem 6 \
        --priority 0 \
        --program /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --file_in ${ancestry}_chr${chr}.bim \
        --file_out ${ancestry}_chr${chr}_renamed.bim \
        --legend /share/nas03/bioinformatics_group/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr$chr.legend.gz \
        --file_in_header 0 \
        --file_in_id_col 1 \
        --file_in_chr_col 0 \
        --file_in_pos_col 3 \
        --file_in_a1_col 4 \
        --file_in_a2_col 5 \
        --chr ${chr}
done