# Applying Nyholt Correction to chrX SNPs
__Author:__ Jesse Marks <br>
**Date**: September 13, 3018

This documentation is relevant to [GitHub Issue #98](https://github.com/RTIInternational/bioinformatics/issues/98#issuecomment-421154584). In order to make corrections for multiple testing in the data, one can apply the [Nyholt Correction Method](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1181954/). The Nyholt Correction Method is a simple correction for multiple testing for SNPs in linkage disequilibrium with each other.

We want to run chrX association analyses for HIV acquisition in WIHS1, WIHS2, UHS1+2+3, and VIDUS. There were a list a chrX SNPs from the UKBioBank GWAS results that were significant (listed below). The evidence continues to improve for these chrX SNPs when we combine cohorts during meta-analyses. They're even associated in WIHS2 AAs by themselves at `P<0.05`. The updated meta-analysis across WIHS1 and WIHS2 (all available ancestries, total `N=4025`, best `P=0.007`). Interestingly, much of the signal is driven by AAs, which does have the larger sample size.

I will compute the Nyholt correction in the ALL panel for this list (all on chr. X).
```
rs614503:112046351:A:T
rs644416:112048479:C:A
rs5903400:112053941:GA:G
rs4460547:112055287:C:A
rs647000:112055663:C:G
rs5974283:112056081:A:G
rs604591:112064874:A:C
rs17004059:112065240:G:A
rs5973962:112067381:T:C
rs620730:112073165:C:T
rs591816:112073370:A:C
rs685830:112074176:T:C
rs57610141:112075057:G:GA
rs650005:112075127:T:C
```

## Nyholt Correction
Find effective number of independent SNPs.

In [None]:
# EC2 command line ##
base_dir='/home/ec2-user/jmarks/hiv/wihs1-2_nyholt'
processing_dir='/home/ec2-user/jmarks/hiv/wihs1-2_nyholt/matSpD'
in_data=${base_dir}/'wihs1_snp_list.txt'
out_data=${base_dir}/'wihs2_snp_list.txt'
# matSpD should be in ~/bin

# Copy variants with phase 3 IDs to final file
perl -lane 'if ($F[0] =~ /\:/) { print $F[0]; }' $in_data >> $out_data
# Convert non-phase 3 variant IDs to phase 3 IDs based on name
for variant in $(perl -lane 'if ($F[0] !~ /\:/) { print $F[0]; }' $in_data); do
    chr=$(grep -P "$variant\s" $in_data | perl -lane 'print $F[1];')
    gunzip -c /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.legend.gz|
        grep "$variant:" |
        perl -lane 'print $F[0];'
done >> $out_data
# Convert non-phase 3 variant IDs to phase 3 IDs based on position
for position in $(perl -lane 'if ($F[0] !~ /\:/) { print $F[2]; }' $in_data); do
    chr=$(grep -P "$position" $in_data | grep -v ":" | perl -lane 'print $F[1];')
    gunzip -c /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr${chr}.legend.gz|
        grep ":$position:"
done > /home/ec2-user/nyholt/044/044_variant_list.nonphase3.legend
perl -lane 'BEGIN { $lastPosition = 0 } if ($F[1] != $lastPosition) { print $F[0]; $lastPosition = $F[1]; }' \
    /home/ec2-user/nyholt/044/044_variant_list.nonphase3.legend >> $out_data


### START Extract replication SNPs from 1000G ###
# Extract variants from 1000G panel
#for (( chr=1; chr<23; chr++ )); do
chr=23
    /shared/bioinformatics/software/scripts/qsub_job.sh \
      --job_name ALL_1000G \
      --script_prefix ${base_dir}/wihs1-2_nyholt \
      --mem 10 \
      --priority 0 \
      --program /shared/bioinformatics/software/perl/file_conversion/convert_reference_panels.pl \
          --impute2_hap /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.hap.gz \
          --impute2_legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.legend.gz \
          --impute2_sample /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3.sample \
          --extract $out_data \
          --out ${base_dir}/1000G_ALL.chr$chr.extracted_variants \
          --generate_plink_ped_file \
          --generate_plink_map_file \
          --chr $chr
#done

### END Extract replication SNPs from 1000G ###


### START Generate correlation matrix ###

#for (( chr=1; chr<23; chr++ )); do
chr=23
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --noweb \
        --file ${base_dir}/1000G_ALL.chr$chr.extracted_variants.plink \
        --r \
        --matrix \
        --out ${base_dir}/1000G_ALL.chr$chr.extracted_variants.r
#done

 ### END Generate correlation matrix ###


### START Run matSpD analysis ###

#for chr in {1..22}; do
chr=23
    echo $chr
    cp ${base_dir}/1000G_ALL.chr$chr.extracted_variants.r.ld ${processing_dir}/correlation.matrix
    cd ${processing_dir}
    R CMD BATCH ${processing_dir}/matSpDlite.R
    mv ${processing_dir}/matSpDlite.out ${base_dir}/1000G_ALL.chr$chr.extracted_variants.matspdlite
done


veffLi=0
chr=23
#for chr in {1..22}; do
    chrVeffLi=$(grep -A 2 Equation ${base_dir}/1000G_ALL.chr$chr.extracted_variants.matspdlite |\
                tail -n +3 | perl -pe 's/^\s+//; s/\s+$//;')
    echo $chr
    echo $chrVeffLi
    veffLi=`echo $veffLi + $chrVeffLi | bc`
done
echo $veffLi

'''4'''

### END Run matSpD analysis ###
