# Geneva prostate cancer (AA subjects only)
** Author: ** Jesse Marks
This notebook contains the data processing steps (QC pipeline) for the GENEVA prostate cancer study [phs000306.v4.p1](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000306.v4.p1). The GENEVA prostate cancer study contains multiple consent goups of which we only have access to consent groups c2 and c5. These are both downloaded and then combined into a single directory structure. **Note:** These steps have already been produced by Bryan Quach in the bioinformatics [GitHub Repo](https://github.com/RTIInternational/bquach_notebooks/blob/master/heroin_project/develop/20170928_oa_gwas_data_processing.ipynb).

| Item | Description | Price |
| --- | --- | ---: |
| item1 | item1 description | 1.00 |
| item2 | item2 description | 100.00 |

### creating directory structure on MIDAS
Preprocess data includes gunzip, renaming, etc.

In [None]:
# local #
ssh jmarks@rtplhpc01.rti.ns

# MIDAS #
cd /share/nas04/bioinformatics_group/data/amazon_s3/studies
cp -r geneva_prostate/ /share/nas03/jmarks/QC-test/
cd /share/nas03/jmarks/QC-test/geneva_prostate/

# Going to remove the processed data that Bryan did and process everything myself for a sanity check.
rm -rf phenotype/processing/*
rm -rf genotype/original/processing/*
cp -r phenotype/unprocessed/* /phenotype/processing
cp -r genotype/original/unprocessed/* /genotype/original/processing

# Untar and rename genotype data
for i in ./genotype/original/processing/c2/*tar; do tar -xvf $i -C $(dirname $i); done
for i in ./genotype/original/processing/c5/*tar; do tar -xvf $i -C $(dirname $i); done
rm genotype/original/processing/c2/*.tar
rm genotype/original/processing/c5/*.tar
gunzip -r */
for i in c2/*; do mv $i $(echo $i | perl -pi -e s/phg00.+Cancer_//g); done
for i in c5/*; do mv $i $(echo $i | perl -pi -e s/phg00.+Cancer_//g); done

## Genotype Processing (AA subjects only)

### Quality Control Sample Tracking
The table below provides statistics on variants and subjects filtered during each step of the QC process

####Pre-chromosome type partitioning
** Note: ** This data set contains chrY and chrM variants that will be excluded after reducing to autosomes and chrX. Initial subject counts are determined by considering only subjects with both genotype and phenotype data available. Consequently, initial numbers will not match between the .fam files and the phenotype files (group c5 is completely missing phenotype data).

### Exlusion of subjects without phenotype data
The .fam file contains more subject IDs than the phenotype file. The subjects without phenotype data are excluded as they provide no benefit for QWA.

In [None]:
# MIDAS #
# command line 

# Get subject IDs from phenotype data
cd /share/nas03/jmarks/QC-test/geneva_prostate/genotype/original/processing
mkdir aa
tail -n +12 ../../../phenotype/processing/c2/Subject.MULTI.txt | \ \
    cut -f 2 | \
    sort > ../../../phenotype/processing/c2/subject_ids.txt

# Add family IDs
grep -f ../../../phenotype/processing/c2/subject_ids.txt c2/AA.genotype-calls-matrixfmt.c2/subject_level_filtered_PLINK_sets/GENEVA_MEC_ProstateCancer_AA_FORWARD_subject_level_c2.fam | \
    cut -d ' ' -f 1,2 \
    > ../../../phenotype/processing/c2/subject_ids.keep

# Create filtered PLINK fileset 
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --noweb \
    --memory 2048 \
    --bfile c2/AA.genotype-calls-matrixfmt.c2/subject_level_filtered_PLINK_sets/GENEVA_MEC_ProstateCancer_AA_FORWARD_subject_level_c2 \
    --keep ../../../phenotype/processing/c2/subject_ids.keep \
    --make-bed \
    --out aa/genotypes


## Update dbSNP and genome build
To ensure that all of the population controls have variant and genomic data in dbSNP 138 and genome build 37 format, use ID and position mappers to make the updates.

In [None]:
# command line #
cd /share/nas03/jmarks/QC-test/geneva_prostate/genotype/original/processing

# Update variant chr
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --noweb \
    --memory 2048 \
    --bfile aa/genotypes \
    #update variant chromosomes \
    -- update-chr /share/nas03/bioinformatics_group/common/build_conversion/b37/dbsnp_b138/uniquely_mapped_snps.chromosomes \
    --make-bed \
    --out aa/genotypes_chr_b37

# Update variant chr coordinate
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --noweb \
    --memory 2048 \
    --bfile aa/genotypes_chr_b37 \
    --update-map /share/nas03/bioinformatics_group/common/build_conversion/b37/dbsnp_b138/uniquely_mapped_snps.positions \
    --make-bed \
    --out aa/genotypes_chr_position_b37

# Filter to only build 37 uniquely mapped variants
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --noweb \
    --memory 2048 \
    --bfile aa/genotypes_chr_position_b37 \
    --extract /share/nas03/bioinformatics_group/common/build_conversion/b37/dbsnp_b138/uniquely_mapped_snps.ids \
    --make-bed \
    --out aa/genotypes_b37_dbsnp138



## Partition into autosomes and chrX groups
Apply QC to autosomes and chrX separately, so separate subdirectories are created for the processing of each set.

In [None]:
# Command line #
cd /share/nas03/jmarks/QC-test/geneva_prostate/genotype/original/processing/aa/
mkdir autosomes chrX

# Autosomes (this command exludes all unplaced and non-autosomal variants)
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --bfile genotypes_b37_dbsnp138 \
    --autosome \
    --make-bed \
    --out autosomes/genotypes_b37_dbsnp138


# ChrX (include split PARs)
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --bfile genotypes_b37_dbsnp138 \
    --chr 23,25 \
    --make-bed \
    --out chrX/genotypes_b37_dbsnp138_unmerged

# Combine split chrX and PARs
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --bfile chrX/genotypes_b37_dbsnp138_unmerged \
    --merge-x \
    --make-bed \
    --out chrX/genotypes_b37_dbsnp138


## Missing autosome data subject filtering
Calculate the proportion of missing genotype calls per chromosome using PLINK to assess whether any subjects have data missing for whole autosomes.

In [None]:
# Command line #
cd /share/nas03/jmarks/QC-test/geneva_prostate/genotype/original/processing/aa

# Get missing call rate per chr
for chr in {1..22}; do
    /share/nas03/bioinformatics_group/software/scripts/qsub_job.sh \
    --job_name aa_${chr} \
    --script_prefix autosomes/chr${chr}_missing_call_rate \
    --mem 3.8 \
    --priority 0 \
    --program /share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
        --noweb \
        --bfile autosomes/genotypes_b37_dbsnp138 \
        --missing \
        --chr $chr \
        --out autosomes/chr${chr}_missing_call_rate
done

# Find subjects that have missing for whole autosomes
for chr in {1..22}; do
    tail -n +2 autosomes/chr${chr}_missing_call_rate.imiss | \
        awk '{ OFS="\t" } { if($6==1){ print $1,$2 } }' >> autosomes/missing_whole_autosome.remove
done

For this case none of the subjects had missing autosome data. If subjects ever show up as having missing autosome data then further discussions need to be had on whether these subjects should be removed completely or whether they should only be excluded for the missing chromosomes.

In [None]:
# clean up autosome directory
rm autosomes/chr*missing_call_rate*

## Remove duplicate SNPs
If multiple rsIDs are present then the one with the better genotype call rate across subjects should be retained. Obtaining the genotype call rates across subjects would need to be calculated using PLINK --missings

In [None]:
# Command line #
cd /share/nas03/jmarks/QC-test/geneva_prostate/genotype/original/processing/aa

# Find duplicate rsIDs
cut -f2,2 autosomes/genotypes_b37_dbsnp138.bim | sort | uniq  -D > autosomes/variant_duplicates.txt

Note: no duplicated rsIDs found for this case 

## DNA strand flipping
To determine if strand flipping is an issue, I performed a merge between chr1 of the study data and 1000 Genomes Phase 3 data. Doing this will produce a log of problematic variants that may be attributable to strand flipping. These results can be compared to chr1 data with flipped variants to see if strand orientation is truly the issue.

As of 11/17/2017 the binary filesets per chromosome in PLINK format for 1000G Phase 3 can be found in 
/share/nas03/bioinformatics_group/data/re_panels/1000G/2013.05/plink.

In [None]:
# Command line #
cd /share/nas03/jmarks/QC-test/geneva_prostate/genotype/original/processing/aa

# Extract chr1 unflipped variants
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --noweb \
    --bfile autosomes/genotypes_b37_dbsnp138 \
    --chr 1 \
    --make-bed \
    --out chr1_unflipped

# Attempt merge with 1000G chr1 data
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --noweb \
    --bfile chr1_unflipped \
    --bmerge /share/nas03/bioinformatics_group/data/ref_panels/1000G/2013.05/plink/ALL.chr1 \
    --make-bed \
    --out chr1_unflipped_test


# Extract chr1 flipped variants
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --noweb \
    --bfile autosomes/genotypes_b37_dbsnp138 \
    --chr 1 \
    --flip chr1_unflipped_test-merge.missnp \
    --make-bed \
    --out chr1_flipped

# Attemp merge with 1000G chr1 data
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --noweb \
    --bfile chr1_flipped \
    --bmerge /share/nas03/bioinformatics_group/data/ref_panels/1000G/2013.05/plink/ALL.chr1 \
    --make-bed \
    --out chr1_flipped_test

# Clean up
rm chr1_*

Between the flipped and unflipped merge, the flipped merge produced drastically less errors. As a consequence, I will apply a merge for each chromosome to produce a flip list then flip the variants.

In [1]:
# Command line #
cd /share/nas03/jmarks/QC-test/geneva_prostate/genotype/original/processing/aa/

# Get flip list 
for chr in {1..22}; do
    # Extract unflipped variants
    /share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
        --noweb \
        --bfile autosomes/genotypes_b37_dbsnp138 \
        --chr ${chr} \
        --make-bed \
        --out chr${chr}_unflipped

    # Attempt merge with 1000G data
    /share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
        --noweb \
        --bfile chr${chr}_unflipped \
        --bmerge /share/nas03/bioinformatics_group/data/ref_panels/1000G/2013.05/plink/ALL.chr${chr} \
        --make-bed \
        --out chr${chr}_unflipped_test

    # Extract flipped variants
    /share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
        --noweb \
        --bfile autosomes/genotypes_b37_dbsnp138 \
        --chr ${chr} \
        --flip chr${chr}_unflipped_test-merge.missnp \
        --make-bed \
        --out chr${chr}_flipped

    # Attempt merge with 1000G data
    /share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
        --noweb \
        --bfile chr${chr}_flipped \
        --bmerge /share/nas03/bioinformatics_group/data/ref_panels/1000G/2013.05/plink/ALL.chr${chr} \
        --make-bed \
        --out chr${chr}_flipped_test
done


# Combine flip lists
cat chr*_unflipped_test-merge.missnp | sort -u > chr_all_unflipped_test-merge.missnp
cat chr*_flipped_test-merge.missnp | sort -u > chr_all_flipped_test-merge.missnp
comm -23 chr_all_unflipped_test-merge.missnp chr_all_flipped_test-merge.missnp \
    > autosomes/chr_all.flip

# Perform final flip
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --noweb \
    --bfile autosomes/genotypes_b37_dbsnp138 \
    --flip autosomes/chr_all.flip \
    --make-bed \
    --out autosomes/genotypes_b37_dbsnp138_flipped

# Clean up
rm chr*flipped*


## Detecting ancestral outliears with STRUCTURE
[STRUCTURE](https://web.stanford.edu/group/pritchardlab/structure.html) is a software tool that can be used to identify admixed individuals, among other uses. By comparing the study subjects with the 10000 Genomes Phase 3 reference panel, I can estimate the composition of an individual's ancestry to determine any discrepancies between self-reporting and genetic information. For the study data, I will be comparing the individuals to 3 different superpopulations from the 1000 Genomes Phase 3 reference panel

* AFR (African)
* EAS (East Asian)
* EUR (European)

#### SNP subset selection
For computational efficiency 10,000 SNPs are randomly chosen from the intersection of SNPs in the study data with the 3 1000 Genomes superpopulations of interest. As of 11/6/2017 the binary filesets per chromosome in PLINK format for 1000G Phase 3 can be found in `/share/nas03/bioinformatics_group/data/ref_panels/1000G/2013.05/plink`. Although I am unsure of the dbSNP build for the rsIDs, the discrepancies should not be an issue since in the end we are only using a subset of the SNPs for input into STRUCTURE.

**NOTE:** From correspondence with Nathan Gaddis I learned that `/share/nas03/bioinformatics_group/data/ref_panels/1000G/2014.10/` also contains 1000G Phase 3 data derived form the May 2013 release. The difference is that it was downloaded from the [IMPUTE2 website](https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html) and reformatted to be directly compatible with IMPUTE2.

In [None]:
# Command line #
cd /share/nas04/bioinformatics_group/data/amazon_s3/studies/geneva_prostate/genotype/original/processing/
mkdir structure

# Get lists of non-A/T and non-C/G SNPs
ancestry="aa"
perl -lane 'if (($F[4] eq "A" && $F[5] ne "T") || ($F[4] eq "T" && $F[5] ne "A") || ($F[4] eq "C" && $F[5] ne "G") || ($F[4] eq "G" && $F[5] ne "C")) { print $F[1]; }' \
${ancestry}/autosomes/genotypes_b37_dbsnp138_flipped.bim | \
    sort -u | \
    grep "rs" \
    > structure/${ancestry}_no_at_cg_snps.txt

# Get list of variants from 1000G
mkdir structure/1000g_data
/share/nas03/bioinformatics_group/software/scripts/qsub_job.sh \
    --job_name merge_1000g_snps \
    --script_prefix structure/1000g_data/merge_1000g_snps \
    --mem 3 \
    --priority 0 \
    --program "cat /share/nas03/bioinformatics_group/data/ref_panels/1000G/2013.05/plink/ALL.chr{1..22}.bim | \
        cut -f2,2 | \
        sort -u | \
        grep \"rs\" > structure/1000g_data/1000g_phase3_snps.txt"

### Extract SNP subset PLINK binary filesets for 1000G data
The 05/2013 release of th 1000 Genomes data have been previously processed and converted to PLINK binary fileset format, but the files included all the 1000G individuals. We are interested in only three superpopulations, so we create filesets specifically for each of these groupls. It was brought to my attention that 1000G Phase 3 rsIDs may be duplicated across chromosomes potentially causing chromosome merging issues. For that reason, it is typically recommended to process the chromosomes separately initially, then combine post SNP subsetting. Subject IDs with superpopulations are available at 
`/share/nas03/bioinformatics_group/data/ref_panels/1000G/igsr_samples.tsv`

In [None]:
# Command line #
cd /share/nas03/jmarks/QC-test/geneva_prostate/genotype/original/processing/

# Ancestry specific directories
for pop in {AFR,EAS,EUR}; do
    mkdir structure/1000g_data/${pop}
done

# Get subject IDs by ancestry
awk 'BEGIN { FS="\t"; OFS="\t" } { if($7=="African"){print $1,$1} }' /share/nas03/bioinformatics_group/data/ref_panels/1000G/igsr_samples.tsv \
    > structure/1000g_data/AFR/AFR_subject_ids.txt
awk 'BEGIN { FS="\t"; OFS="\t" } { if($7=="East Asian"){print $1,$1} }' /share/nas03/bioinformatics_group/data/ref_panels/1000G/igsr_samples.tsv \
    > structure/1000g_data/EAS/EAS_subject_ids.txt
awk 'BEGIN { FS="\t"; OFS="\t" } { if($7=="European"){print $1,$1} }' /share/nas03/bioinformatics_group/data/ref_panels/1000G/igsr_samples.tsv \
    > structure/1000g_data/EUR/EUR_subject_ids.txt

# Make new binary filesets for each 1000G group
# Memory request is high because MIDAS fails on this step if whole node not occupied by a job
for pop in {AFR,EAS,EUR}; do
    for chr in {1..22}; do
        /share/nas03/bioinformatics_group/software/scripts/qsub_job.sh \
            --job_name ${pop}_${chr}_filter \
            --script_prefix structure/1000g_data/${pop}/ancestry_partition_chr${chr} \
            --mem 15.5 \
            --priority 0 \
            --program /share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
                --noweb \
                --memory 10000 \
                --bfile /share/nas03/bioinformatics_group/data/ref_panels/1000G/2013.05/plink/ALL.chr${chr} \
                --keep structure/1000g_data/${pop}/${pop}_subject_ids.txt \
                --make-bed \
                --out structure/1000g_data/${pop}/${pop}.chr${chr}
    done
done



# Apply SNP subset extraction by chr
for pop in {AFR,EAS,EUR}; do
    for chr in {1..22}; do
        /share/nas03/bioinformatics_group/software/scripts/qsub_job.sh \
            --job_name ${pop}_${chr}_subsample \
            --script_prefix structure/1000g_data/${pop}/ancestry_partition_chr${chr} \
            --mem 7.6 \
            --priority 0 \
            --program /share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
                --noweb \
                --memory 7500 \
                --bfile structure/1000g_data/${pop}/${pop}.chr${chr} \
                --extract structure/10k_snp_random_sample.txt \
                --make-bed \
                --out structure/1000g_data/${pop}/${pop}_chr${chr}_10k_snp_random_sample
    done
done

# Create merge lists and merge autosomes for each 1000G population
data_dir=structure/1000g_data
for pop in {AFR,EAS,EUR}; do
    echo "${data_dir}/${pop}/${pop}_chr1_10k_snp_random_sample" > ${data_dir}/${pop}/${pop}_autosome_merge_list.txt
    for chr in {2..22}; do
        echo "${data_dir}/${pop}/${pop}_chr${chr}_10k_snp_random_sample" \
        >> ${data_dir}/${pop}/${pop}_autosome_merge_list.txt
    done
done

for pop in {AFR,EAS,EUR}; do
    /share/nas03/bioinformatics_group/software/scripts/qsub_job.sh \
        --job_name ${pop}_merge_plink_filesets \
        --script_prefix structure/1000g_data/${pop}/merge_plink_filesets \
        --mem 4 \
        --priority 0 \
        --program /share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
            --noweb \
            --memory 4000 \
            --merge-list structure/1000g_data/${pop}/${pop}_autosome_merge_list.txt \
            --snps-only just-acgt \
            --make-bed \
            --out structure/1000g_data/${pop}/${pop}_all_autosomes_10k_snp_random_sample
done

### Discrepancy assessment between 1000G and study data
As a quality check that the SNP data subsampled from the 1000 Genomes and study data are the same, I will attempt to merge an arbitrarily selected group from each data set using PLINK. If any errors are found, PLINK will generate an error file. Likely causes of errors would be:

* SNP genomic coordinates not matching
* SNP duplicates found
* SNP strand orientation flipped

If errors are fo;und the two options for mobing forward are
1. Re-run the 10,000 SNP subsampling and hope the SNPs chosen do not raise issues
2. Remove the problematic SNPs

Option #2 is the prfered approach and the one I will be taking.

In [None]:
# Command line #
cd /share/nas03/jmarks/QC-test/geneva_prostate/genotype/original/processing

# Merge AA and AFR genotype files
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --noweb \
    --file structure/aa_10k_snp_random_sample \
    --bmerge structure/1000g_data/AFR/AFR_all_autosomes_10k_snp_random_sample \
    --recode \
    --out structure/1000g_AFR_prostate_aa_10k_snp_random_sample

The attempt merge generated 3 errors. In the case of an unsuccessful merge, the following are common:

* Multiple positions found for a variant
* Non-biallelic variants found

When these occur, I exclude variants with multiple positions and see if flipping the non-biallelic variants resolves the second issue.

In [None]:
# Get list of multiple position variants
grep "Multiple positions seen for variant" structure/1000g_AFR_prostate_aa_10k_snp_random_sample.log | \
    cut -d"'" -f2,2 > structure/1000g_AFR_prostate_aa_10k_snp_random_sample.bad_snps.remove

# Flip study AA non-biallelic SNPs and remove multi-position variants
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --noweb \
    --file structure/aa_10k_snp_random_sample \
    --exclude structure/1000g_AFR_prostate_aa_10k_snp_random_sample.bad_snps.remove \
    --flip structure/1000g_AFR_prostate_aa_10k_snp_random_sample.missnp \
    --recode \
    --out structure/aa_10k_snp_random_sample_retry

# Remove multi-position variants from 1000G AFR
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --noweb \
    --bfile structure/1000g_data/AFR/AFR_all_autosomes_10k_snp_random_sample \
    --exclude structure/1000g_AFR_prostate_aa_10k_snp_random_sample.bad_snps.remove \
    --make-bed \
    --out structure/AFR_all_autosomes_10k_snp_random_sample_retry

# Retry merge
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --noweb \
    --file structure/aa_10k_snp_random_sample_retry \
    --bmerge structure/AFR_all_autosomes_10k_snp_random_sample_retry \
    --make-bed \
    --out structure/1000g_AFR_prostate_aa_10k_snp_random_sample_retry

If the merge retry fails then the subset of failed flipped SNPs will get combined with the multi-position SNPs into a blacklist to use for creating the final PED files. Otherwise a separate blacklist and flip list will be used with PLINK --exclude and --flip.

In [None]:
# Create final exclusion list
cat structure/1000g_AFR_prostate_aa_10k_snp_random_sample_retry-merge.missnp structure/1000g_AFR_prostate_aa_10k_snp_random_sample.bad_snps.remove \
> structure/10k_snp_random_sample_blacklist.txt

# Create final flip list if retry merge was successful
cat structure/1000g_AFR_prostate_aa_10k_snp_random_sample_retry-merge.missnp structure/1000g_AFR_prostate_aa_10k_snp_random_sample.missnp | \
sort | uniq -u > structure/10k_snp_random_sample_flip_list.txt

# File cleanup
rm structure/1000g_AFR_prostate_aa_10k_snp_random_sample.*

### STRUCTURE input file construction
Because our initial merge and flip test was unsuccessful, I proceed with applying a blacklist filtering.

In [None]:
# Command line #
cd /share/nas04/bioinformatics_group/data/amazon_s3/studies/geneva_prostate/genotype/original/processing/

# Create final ped and map files for study genotype data for SNP subset
ancestry="aa"
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --noweb \
    --memory 3000 \
    --bfile ${ancestry}/autosomes/genotypes_b37_dbsnp138_flipped \
    --extract structure/10k_snp_random_sample.txt \
    --exclude structure/10k_snp_random_sample_blacklist.txt \
    --snps-only just-acgt \
    --recode \
    --out structure/${ancestry}_10k_snp_random_sample.final


# Create ped and map files for each 1000G population
for pop in {AFR,EAS,EUR}; do
    /share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --noweb \
    --memory 1024 \
    --bfile structure/1000g_data/${pop}/${pop}_all_autosomes_10k_snp_random_sample \
    --exclude structure/10k_snp_random_sample_blacklist.txt \
    --recode \
    --out structure/1000g_data/${pop}_10k_snp_random_sample.final
done

# Final check for SNP discrepancies
/share/nas03/bioinformatics_group/software/plink_1.9_beta3.45/plink \
    --noweb \
    --file structure/aa_10k_snp_random_sample.final \
    --merge structure/1000g_data/AFR_10k_snp_random_sample.final \
    --recode \
    --out structure_input_test

# File cleanup
rm structure_input_test*

No merging issues were identified in the final check, so I will use the script ped2structure.pl to convert the PED file into a STRUCTURE input file format. This script takes two inputs. The first is an integer that serves as an ID to distinguish between a reference panel population or a study data set group. The second input is an integer that is unique to each group/population regardless of whether it's from the study or 1000G data.

The goal of the conversion script is to generate a single STRUCTURE input file containing genotype information for the ~10,000 (post-filtered) subsampled SNPs and the individuals from the study and 1000G data sets. Documentation on the format can be found [here](https://web.stanford.edu/group/pritchardlab/structure_software/release_versions/v2.3.4/structure_doc.pdf). The first three columns contain the following information respectively
1. Subject indentifier
2. Group/population identifier. Distinc for each ancestry group or superpopulation
3. Boolean indicatory (1=True, 0=False) specifying reference panel populations. This is used by STRUCTURE to define the ancestry groups

We will be runing structure assuming that the study subjects descended from threee populations. The traditional approach would be to use AFR, EAS, and EUR. I will run STRUCTURE using theese 1000G superpopulations.

**Note:** For `K` reference panel populations used for ancestry comparisons, the reference panel populations must be given group IDs between 1 and `K`.

In [None]:
# Command line #
cd /share/nas03/jmarks/QC-test/geneva_prostate/genotype/original/processing
mkdir structure/input_files

#### Create STRUCTURE file with AFR, EAS, and EUR ####


groupID=1 #distinguish between all groups

# Append 1000G populations to STRUCTURE file
truncate -s 0 structure/input_files/input_afr_eas_eur
for pop in {AFR,EAS,EUR}; do
    cat structure/1000g_data/${pop}_10k_snp_random_sample.final.ped | \
    /share/nas03/bioinformatics_group/software/perl/ped2structure.pl 1 ${groupID} \
    >> structure/input_files/input_afr_eas_eur
    groupID=`echo ${groupID} + 1 | bc`
done
