# [FTND] Minnesota 1df processing
**Author**: Jesse Marks

~~**Date**: July 27, 2018~~ <br>
**Date**: August 10, 2018

**GitHub Issue:** [#105](https://github.com/RTIInternational/bioinformatics/issues/105#issuecomment-412103553)

The following email was sent to Jesse Marks from John Guo
on 20180717. The email subject line reads:<br>
`"[FTND] - Minnesota 1df processing"`

```css
The new Minnesota results are available at 

https://s3.console.aws.amazon.com/s3/buckets/rti-uploads/hyoung/?region=us-east-1&tab=overview

I used script on MIDAS to split the file by chromosome:
/share/nas04/bioinformatics_group/data/studies/minnesota_twins/raw_results/process_raw.20180611.py

Then performed ID conversion, filtering, and plotting using script here:
/share/nas04/bioinformatics_group/data/studies/minnesota_twins/imputed/v1/association_tests/002/_methods.minnesota_twin.imputed.assoc_tests.001.sh

Could you please move the files to an appropriate location on S3 and process it on AWS?

Please let me know if you have any questions.
```

**Update August 10, 2018**: There were issues with processing this Minnesota Twins version 3 GWAS data. Specifically, the RSQ and PVALUE data were missing. The re-processing of these data are described in the paragraphs below. The updated processing steps+details in this notebook begin in the section titled `Re-processing`.

```
There were three versions of Minnesota Twins results. The first version was
fine except that the results did not contain the imputation quality column
needed to perform the standard RSQ>0.30 filter that is in our GWAS processing
protocol. The second version contained the imputation quality data (RSQ column)
however the PVALUE does not always match exactly with the first version. The
third version is missing both PVALUE and RSQ data. 

Considering the details gathered from email correspondence between Dana Hancock
and Hannah Young, it was decided that we will use the version 1 summary
statistics and append the RSQ column from version 2.
```

## Download Data From S3

In [None]:
## EC2 console ##
mkdir -p /shared/data/studies/minnesota_twins/raw_results
cd /shared/data/studies/minnesota_twins/raws_results

aws s3 sync s3://rti-uploads/hyoung/ .

## Split up by chromosome
These GWAS results are all combined in one file. We will split them apart by chromosome first.

In [None]:
import gzip,sys
BASE_DIR = '/shared/data/studies/minnesota_twins/'
for ethnicity in ['ea']:
        inF = gzip.open(BASE_DIR + 'raw_results/MCTFR_FTND.MetaScore.assoc.gz')
        # each time this is called, the next line will be returned.
        line = inF.readline() 
        # keep skipping the lines until the line with the headers appears.
        # Namely, the headers start with one pound sign then CHROM ...
        while(line[:2] == "##"):
                line = inF.readline()

        firstLine = line[1:] # skip that first pount in the header line
        # split returns a list of words (headers)
        chrIndex = firstLine.split().index('CHROM') # returns 0 because CHROM is first header
        lastChr = ''
        line = inF.readline() # go to actual chromosome
        while(line): # while we are not at the end of the file
                # if the new line is not the same chromosome as the last one processed enter loop
                # note it will enter this loop right from the gitgo because lastChr= '' by default
                if(line.split()[chrIndex] != lastChr): 
                        fname = 'minnesota_twins.' + ethnicity + '.1000G.chr' + line.split()[chrIndex] + '.' + 'CAT_FTND~1df_add.out.txt'
                        dir = '001/' + ethnicity + '/' + 'processing/chr' + line.split()[chrIndex] + '/'
                        outF = file(BASE_DIR + 'imputed/v1/association_tests/' + dir + fname, 'w')
                        # write to a new file based on the new chr we are processing
                        # also add the column Marker to the column header
                        outF.write("Marker" + "\t" + "\t".join(firstLine.split()) + "\n")
                        lastChr = line.split()[chrIndex] # new last chromosome now
                        print('Processing : ' + 'Chr ' + lastChr)
                # creating the Markername = CHR:POSITION
                tmp = line.split()
                outF.write(tmp[0] + ":" + tmp[1] + "\t" + line)
                line = inF.readline() # read the next line


## Then performed ID conversion, filtering, and plotting using script here

In [None]:
for ethnicity in ea; do
  for (( chr=1; chr<23; chr++ )); do
    mkdir -p /shared/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/chr$chr
  done
  mkdir /shared/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/final
done

### Convert ID to phase 3 ID ###
for ethnicity in ea; do
    if [ $ethnicity == "aa" ]
    then
        group=afr
    else
        group=eur
    fi
    for (( chr=1; chr<23; chr++ )); do
        /shared/bioinformatics/software/scripts/qsub_job.sh \
            --job_name MTC_${chr} \
            --script_prefix /shared/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/chr$chr/minnesota_twins.$ethnicity.1000G.chr$chr.CAT_FTND~1df.phase3ID_add.out.txt \
            --mem 15 \
            --nslots 3 \
            --priority 0 \
            --program perl /shared/bioinformatics/software/perl/id_conversion/convert_to_1000g_p3_ids.pl \
            --file_in /shared/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/chr$chr/minnesota_twins.$ethnicity.1000G.chr$chr.CAT_FTND~1df_add.out.txt \
            --file_out /shared/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/chr$chr/minnesota_twins.$ethnicity.1000G.chr$chr.CAT_FTND~1df.phase3ID_add.out.txt \
            --legend /shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chr$chr.legend.gz \
            --file_in_header 1 \
            --file_in_id_col 0 \
            --file_in_chr_col 1 \
            --file_in_pos_col 2 \
            --file_in_a1_col 3 \
            --file_in_a2_col 4 \
            --chr $chr
    done
done

# Check for completion
for ethnicity in ea; do
    if [ $ethnicity == "aa" ]
    then
        group=afr
    else
        group=eur
    fi
    for (( chr=1; chr<23; chr++ ))
    do
        file=/shared/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/chr$chr/minnesota_twins.$ethnicity.1000G.chr$chr.CAT_FTND~1df.phase3ID_add.out.txt.qsub.log
        if [ -f $file ]
        then
          logLineCount=$(wc -l $file | perl -lane 'print $F[0];')
          if [ $logLineCount -ne 4 ]
          then
            echo $file line count: $logLineCount
          else
            tail -n 1 $file |
              perl -ne 'chomp; if (!/^Done/) { print "'$file'\n".$_."\n"; }'
          fi
        else
          echo $file missing
        fi
    done
done



### START Filter ###
# create 1000G_p3 MAF filtered files
thouDir=/shared/data/ref_panels/1000G/2014.10
for chr in {1..22};do
    awk ' { if ( $9>=0.01 ) { print $1 } }' <(zcat $thouDir/1000GP_Phase3_chr$chr.legend.gz) >\
        $thouDir/1000GP_Phase3_chr$chr.legend.maf_lte_0.01_eur
done &



thouDir=/shared/data/ref_panels/1000G/2014.10
# MAF > 0.01 in AFR (AA) or EUR (EA)
for ethnicity in ea; do
  if [ $ethnicity == "aa" ]
  then
    group=afr
  else
    group=eur
  fi
  for (( chr=1; chr<23; chr++ )); do
    if [ $chr == "23" ]; then
      idList=/shared/data/ref_panels/1000G/2014.10/1000GP_Phase3_chrX_NONPAR.maf_lte_0.01_$group
    else
      idList=$thouDir/1000GP_Phase3_chr$chr.legend.maf_lte_0.01_eur
    fi
    /shared/bioinformatics/software/scripts/qsub_job.sh \
      --job_name ${ethnicity}_$chr\
      --script_prefix /shared/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/chr$chr/minnesota_twins.$ethnicity.1000G_p3.chr$chr.CAT_FTND~1df.maf_gt_0.01.$group \
      --mem 3.8 \
      --priority 0 \
      --program /shared/bioinformatics/software/perl/utilities/extract_rows.pl \
      --source /shared/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/chr$chr/minnesota_twins.$ethnicity.1000G.chr$chr.CAT_FTND~1df.phase3ID_add.out.txt \
          --id_list $idList \
          --out /shared/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/chr$chr/minnesota_twins.$ethnicity.1000G_p3.chr$chr.CAT_FTND~1df.maf_gt_0.01.$group \
          --header 1 \
          --id_column 0 
  done
done

       
#for ethnicity in ea
#do
#  mv /shared/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/chr*/*.maf_gt_0.01_??? \
#   /shared/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/final
#done

# MAF > 0.01 in study
for ethnicity in ea; do
  if [ $ethnicity == "aa" ]; then
    group=afr
  else
    group=eur
  fi
  for (( chr=1; chr<23; chr++ )); do
    inFile=/shared/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/chr$chr/minnesota_twins.$ethnicity.1000G_p3.chr$chr.CAT_FTND~1df.maf_gt_0.01.$group
    outFile=${inFile}+minnesota_twins
    echo Processing $inFile
    head -n 1 $inFile > $outFile
    tail -n +2 $inFile |
      perl -lane 'if ($F[6] >= 0.01 & $F[6] <= 0.99) { print; }' >> $outFile
  done
done

# RSQ > 0.3 in study
for ethnicity in ea; do
  if [ $ethnicity == "aa" ]; then
    group=afr
  else
    group=eur
  fi
  for (( chr=1; chr<23; chr++ )); do
    inFile=/shared/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/chr$chr/minnesota_twins.$ethnicity.1000G_p3.chr$chr.CAT_FTND~1df.maf_gt_0.01.$group+minnesota_twins
    outFile=${inFile}.RSQ
    echo Processing $inFile
    head -n 1 $inFile > $outFile
    tail -n +2 $inFile |
      perl -lane 'if ($F[17] > 0.3) { print; }' >> $outFile
  done
done


### END Filter ###


### START Generate plots ###
for ethnicity in ea; do
  if [ $ethnicity == "aa" ]; then
    group=afr
  else
    group=eur
  fi
  for ext in $group ${group}+minnesota_twins ${group}+minnesota_twins.RSQ; do
  outFile=/share/nas04/bioinformatics_group/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/minnesota_twins.$ethnicity.1000G.CAT_FTND~1df.maf_gt_0.01_$ext.table
  echo -e "VARIANT_ID\tCHR\tPOSITION\tP\tTYPE" > $outFile
  for (( chr=1; chr<24; chr++ )); do
    inFile=/shared/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/chr$chr/minnesota_twins.$ethnicity.1000G_p3.chr$chr.CAT_FTND~1df.maf_gt_0.01.$ext
    echo Processing $inFile
    tail -n +2 $inFile |
      perl -lne '/^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)(?:\s+\S+){11}\s+(\S+)/;
                  if (($4 eq "A" || $4 eq "C" || $4 eq "G" || $4 eq "T") && ($5 eq "A" || $5 eq "C" || $5 eq "G" || $5 eq "T")) {
                    print join("\t",$1,$2,$3,$6,"snp");
                  } else {
                    print join("\t",$1,$2,$3,$6,"indel");
                  }' >> $outFile
        done
  done
done
for ethnicity in ea; do
  if [ $ethnicity == "aa" ]; then
    group=afr
  else
    group=eur
  fi
  for ext in $group ${group}_minnesota_twins ${group}_minnesota_twins_RSQ; do
  /shared/bioinformatics/software/scripts/qsub_job.sh \
    --job_name gwas_plots \
    --script_prefix /share/nas04/bioinformatics_group/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/minnesota_twins.$ethnicity.1000G.CAT_FTND~1df.maf_gt_0.01_$ext.plots \
    --mem 15 \
    --priority 0 \
    --program /share/nas03/bioinformatics_group/software/R/dev/generate_gwas_plots.v6.R \
    --in /share/nas04/bioinformatics_group/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/minnesota_twins.$ethnicity.1000G.CAT_FTND~1df.maf_gt_0.01_$ext.table \
    --in_chromosomes autosomal_nonPAR \
    --in_header \
    --out /share/nas04/bioinformatics_group/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/minnesota_twins.$ethnicity.1000G.CAT_FTND~1df.maf_gt_0.01_$ext \
    --col_id VARIANT_ID \
    --col_chromosome CHR \
    --col_position POSITION \
    --col_p P \
    --col_variant_type TYPE \
    --generate_snp_indel_manhattan_plot \
    --manhattan_odd_chr_color red \
    --manhattan_even_chr_color blue \
    --manhattan_points_cex 1.5 \
    --generate_snp_indel_qq_plot \
    --qq_lines \
    --qq_points_bg black \
    --qq_lambda
  done
done

for ethnicity in ea
do
  mv /share/nas04/bioinformatics_group/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/*.png \
   /share/nas04/bioinformatics_group/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/final
done

### END Generate plots ###



### START Filter by p-value ###

# MAF > 0.01 in AFR and EUR
for ethnicity in ea
do
  if [ $ethnicity == "aa" ]
  then
    group=afr
  else
    group=eur
  fi
  for ext in $group ${group}_minnesota_twins; do
  outFile=/share/nas04/bioinformatics_group/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/processing/minnesota_twins.$ethnicity.1000G.CAT_FTND~1df.maf_gt_0.01_$ext.p_lte_0.001
  head -n 1 /share/nas04/bioinformatics_group/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/final/minnesota_twins.$ethnicity.1000G.chr1.CAT_FTND~1df.maf_gt_0.01_$ext > $outFile
  for (( chr=1; chr<23; chr++ ))
  do
    inFile=/share/nas04/bioinformatics_group/data/studies/minnesota_twins/imputed/v1/association_tests/001/$ethnicity/final/minnesota_twins.$ethnicity.1000G.chr$chr.CAT_FTND~1df.maf_gt_0.01_$ext
    echo Processing $inFile
    tail -n +2 $inFile |
      perl -lane 'if ($F[16] <= 0.001) { print; }' >> $outFile
  done
  done
done

# Sort
R
inData=read.table("/share/nas04/bioinformatics_group/data/studies/minnesota_twins/imputed/v1/association_tests/001/ea/processing/minnesota_twins.ea.1000G.CAT_FTND~1df.maf_gt_0.01_eur.p_lte_0.001", header = TRUE)
inData = inData[order(inData$PVALUE),]
write.csv(inData, file="/share/nas04/bioinformatics_group/data/studies/minnesota_twins/imputed/v1/association_tests/001/ea/final/minnesota_twins.ea.1000G.CAT_FTND~1df.maf_gt_0.01_eur.p_lte_0.001.csv", row.names = FALSE)
inData=read.table("/share/nas04/bioinformatics_group/data/studies/minnesota_twins/imputed/v1/association_tests/001/ea/processing/minnesota_twins.ea.1000G.CAT_FTND~1df.maf_gt_0.01_eur_minnesota_twins.p_lte_0.001", header = TRUE)
inData = inData[order(inData$PVALUE),]
write.csv(inData, file="/share/nas04/bioinformatics_group/data/studies/minnesota_twins/imputed/v1/association_tests/001/ea/final/minnesota_twins.ea.1000G.CAT_FTND~1df.maf_gt_0.01_eur_minnesota_twins.p_lte_0.001.csv", row.names = FALSE)


# Re-processing
## Download Data From S3

The results for version1 & version2 are located on AWS S3 at `rti-nd/Minnesota/FTND_1df_1000G_p3/20180402` & `rti-nd/Minnesota/FTND_1df_1000G_p3/20180611`, respectively. 

In [None]:
cd /shared/bioinformatics/jmarks/nicotine/gwas/minnesota_twins/data/

aws s3 sync s3://rti-nd/Minnesota/FTND_1df_1000G_p3/20180402 20180402 
aws s3 sync s3://rti-nd/Minnesota/FTND_1df_1000G_p3/20180611 20180611

#unzip data
gzip -dr *

# Check the count of the number of variants in both files
tail -n +39 20180402/*c | wc -l
"""9458862"""

tail -n +39 20180611/*c | wc -l
"""9458936"""

# since there is a discrepancy between the two versions of data 
# (version2 has 74 more variants), I will need to take the intersection
# of the two. We will see if we can identify which SNPs are uniq to each version.

tail -n +39 20180402/*c | awk  'BEGIN{OFS=":"} {print $1,$2,$3,$4}' > chrom+pos+A1+A2.version1 &
tail -n +39 20180611/*c | awk  'BEGIN{OFS=":"} {print $1,$2,$3,$4}' > chrom+pos+A1+A2.version2 &

# lines uniq to version1
comm -23 <(sort chrom+pos+A1+A2.version1) <(sort chrom+pos+A1+A2.version2) > 20180402/variants_exclusive_to_version1 &
wc -l 20180402/variants_exclusive_to_version1
"""2"""

# lines uniq to version2
comm -13 <(sort chrom+pos+A1+A2.version1) <(sort chrom+pos+A1+A2.version2) > 20180611/variants_exclusive_to_version2 &
wc -l 20180611/variants_exlusive_to_version2
"""76"""
# lines that are in both
mkdir 20180810
comm -12 <(sort chrom+pos+A1+A2.version1) <(sort chrom+pos+A1+A2.version2) > 20180810/variants_in_both_versions &
wc -l 20180810/variants_in_both_versions
"""9458860 20180810/variants_in_both_versions""" # note this is what we would expect

## now we just need to subset each file by the variants in the intersection
#awk ' NR==FNR{map[$1]++;next} map[$1]' variants_in_both_versions test

# get headers
tail -n +38 20180402/*c | head -n1 > 20180810/version1_subset
tail -n +38 20180611/*c | head -n1 > 20180810/version2_subset

# subset files
awk ' NR==FNR{map[$1]++} { if ( $1":"$2":"$3":"$4 in map) {print $0}}' 20180810/variants_in_both_versions\
    20180402/*c >> 20180810/version1_subset &
wc -l 20180810/version1_subset

awk ' NR==FNR{map[$1]++} { if ( $1":"$2":"$3":"$4 in map) print $0}' 20180810/variants_in_both_versions\
    20180611/*c  >>20180810/version2_subset &
wc -l 20180810/version2_subset
# double check that lines match
