# 1000 Genome data preparation

Data from the 1000 Genomes Project need to be processed before used for population stratification. This dataset after QC is provided, there is no need to run these first steps.

To complete this task it is necessary to have generated the bfile 'dataset.b37_IBD' from the previous tutorial (Task_2_QC).

In [18]:
%env path= /mnt/data/GWAS/output/task2.2_stratification
%env intpath=/mnt/data/GWAS/output/task2.2_stratification/intermediate_datasets
%env ref_files=/mnt/data/GWAS/ref_files/Phase3_v5

env: path=/mnt/data/GWAS/output/task2.2_stratification
env: intpath=/mnt/data/GWAS/output/task2.2_stratification/intermediate_datasets


In [11]:
%%bash 
# Get a list of SNPs from your working dataset
awk '{print $2}' /mnt/data/GWAS/output/task2_QC/intermediate_datasets/dataset.b37.IBD.bim > $path/mySNPlist.txt
head $path/mySNPlist.txt

1:10177
1:11008
1:11012
1:13110
rs201725126
rs200579949
1:13273
1:14464
1:14599
1:14604


In [14]:
%%bash
# Creates a recode vcf per chromosome, using as input 1000 Genomes compressed vcf file per chr and previous SNPs list.
# Then plink converts them to bfiles. 
for i in {1..22}
do
vcftools --gzvcf $ref_files/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.clean.vcf.gz --snps $path/mySNPlist.txt --recode --out $path/1KG.vcf.chr$i
plink --vcf $path/1KG.vcf.chr$i.recode.vcf --make-bed --out $path/1KG.chr$i
rm $path/1KG.vcf.chr$i.recode.vcf
done

PLINK v1.90b3.45 64-bit (13 Jan 2017)      https://www.cog-genomics.org/plink2
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/data/GWAS/output/task2.2_stratification//1KG.chr1.log.
Options in effect:
  --make-bed
  --out /mnt/data/GWAS/output/task2.2_stratification//1KG.chr1
  --vcf /mnt/data/GWAS/output/task2.2_stratification//1KG.vcf.chr1.recode.vcf

32127 MB RAM detected; reserving 16063 MB for main workspace.
--vcf: /mnt/data/GWAS/output/task2.2_stratification//1KG.chr1-temporary.bed +
/mnt/data/GWAS/output/task2.2_stratification//1KG.chr1-temporary.bim +
/mnt/data/GWAS/output/task2.2_stratification//1KG.chr1-temporary.fam written.
497923 variants loaded from .bim file.
2504 people (0 males, 0 females, 2504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/mnt/data/GWAS/output/task2.2_stratification//1KG.chr1.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2504 founders and 0 nonfo


VCFtools - UNKNOWN
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /mnt/data/GWAS/input/Phase3_v5/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.clean.vcf.gz
	--out /mnt/data/GWAS/output/task2.2_stratification//1KG.vcf.chr1
	--recode
	--snps /mnt/data/GWAS/output/task2.2_stratification//mySNPlist.txt

Using zlib version: 1.2.8
After filtering, kept 2504 out of 2504 Individuals
Outputting VCF file...
After filtering, kept 497923 out of a possible 6467265 Sites
Run Time = 610.00 seconds

VCFtools - UNKNOWN
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--gzvcf /mnt/data/GWAS/input/Phase3_v5/ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.clean.vcf.gz
	--out /mnt/data/GWAS/output/task2.2_stratification//1KG.vcf.chr2
	--recode
	--snps /mnt/data/GWAS/output/task2.2_stratification//mySNPlist.txt

Using zlib version: 1.2.8
After filtering, kept 2504 out of 2504 Individuals
Outputting VCF file...
Aft

In [16]:
%%bash
# Creates a file with the list of bfiles (.bim, .bed, .fam) outputed in the previous step
# 1st-3rd parameter: path to previous plink files
# 4th parameter: output file
for i in {2..22}
do
echo $path/1KG.chr$i.bed $path/1KG.chr$i.bim $path/1KG.chr$i.fam >> $path/1KG.outlist.txt
done

In [17]:
%%bash
# Merge all bfiles into a single one
plink --bfile $path/1KG.chr1 --make-bed --merge-list $path/1KG.outlist.txt --out $path/1KG.all

PLINK v1.90b3.45 64-bit (13 Jan 2017)      https://www.cog-genomics.org/plink2
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/data/GWAS/output/task2.2_stratification//1KG.all.log.
Options in effect:
  --bfile /mnt/data/GWAS/output/task2.2_stratification//1KG.chr1
  --make-bed
  --merge-list /mnt/data/GWAS/output/task2.2_stratification//1KG.outlist.txt
  --out /mnt/data/GWAS/output/task2.2_stratification//1KG.all

32127 MB RAM detected; reserving 16063 MB for main workspace.
Performing single-pass merge (2504 people, 6505760 variants).
Merged fileset written to                     
/mnt/data/GWAS/output/task2.2_stratification//1KG.all-merge.bed +
/mnt/data/GWAS/output/task2.2_stratification//1KG.all-merge.bim +
/mnt/data/GWAS/output/task2.2_stratification//1KG.all-merge.fam .
6505760 variants loaded from .bim file.
2504 people (0 males, 0 females, 2504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/mnt/data/GWAS/output/task2.2_

In [32]:
%%bash
# remove individual chr plink files
for i in {2..22}
do
rm $path/1KG.chr$i.*
done

## QC

1000 Genomes Dataset after this QC is provided (1kG.QCed.forMDS). There is no need to run these steps.

In [28]:
%%bash
# Remove variants based on missing genotype data.
plink --bfile $path/1KG.all --geno 0.05 --allow-no-sex --make-bed --out $intpath/1kG_MDS

PLINK v1.90b3.45 64-bit (13 Jan 2017)      https://www.cog-genomics.org/plink2
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/data/GWAS/output/task2.2_stratification/intermediate_datasets/1kG_MDS.log.
Options in effect:
  --allow-no-sex
  --bfile /mnt/data/GWAS/output/task2.2_stratification/1KG.all
  --geno 0.05
  --make-bed
  --out /mnt/data/GWAS/output/task2.2_stratification/intermediate_datasets/1kG_MDS

32127 MB RAM detected; reserving 16063 MB for main workspace.
6505760 variants loaded from .bim file.
2504 people (0 males, 0 females, 2504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/mnt/data/GWAS/output/task2.2_stratification/intermediate_datasets/1kG_MDS.nosex
.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2504 founders and 0 nonfounders present.
Calculating allele frequencies... 1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859

In [29]:
%%bash
# Remove individuals based on missing genotype data.
plink --bfile $intpath/1kG_MDS --mind 0.03 --allow-no-sex --make-bed --out $intpath/1kG_MDS2

PLINK v1.90b3.45 64-bit (13 Jan 2017)      https://www.cog-genomics.org/plink2
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/data/GWAS/output/task2.2_stratification/intermediate_datasets/1kG_MDS2.log.
Options in effect:
  --allow-no-sex
  --bfile /mnt/data/GWAS/output/task2.2_stratification/intermediate_datasets/1kG_MDS
  --make-bed
  --mind 0.03
  --out /mnt/data/GWAS/output/task2.2_stratification/intermediate_datasets/1kG_MDS2

32127 MB RAM detected; reserving 16063 MB for main workspace.
6500188 variants loaded from .bim file.
2504 people (0 males, 0 females, 2504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/mnt/data/GWAS/output/task2.2_stratification/intermediate_datasets/1kG_MDS2.nosex
.
0 people removed due to missing genotype data (--mind).
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2504 founders and 0 nonfounders present.
Calculating allele frequencies... 1011121314151617181

In [33]:
%%bash
# Remove variants based on MAF.
plink --bfile $intpath/1kG_MDS2 --maf 0.05 --allow-no-sex --make-bed --out $intpath/1kG_MDS3

PLINK v1.90b3.45 64-bit (13 Jan 2017)      https://www.cog-genomics.org/plink2
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/data/GWAS/output/task2.2_stratification/intermediate_datasets/1kG_MDS3.log.
Options in effect:
  --allow-no-sex
  --bfile /mnt/data/GWAS/output/task2.2_stratification/intermediate_datasets/1kG_MDS2
  --maf 0.05
  --make-bed
  --out /mnt/data/GWAS/output/task2.2_stratification/intermediate_datasets/1kG_MDS3

32127 MB RAM detected; reserving 16063 MB for main workspace.
6500188 variants loaded from .bim file.
2504 people (0 males, 0 females, 2504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
/mnt/data/GWAS/output/task2.2_stratification/intermediate_datasets/1kG_MDS3.nosex
.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2504 founders and 0 nonfounders present.
Calculating allele frequencies... 101112131415161718192021222324252627282930313233343536373839404142434445464

**For the next step you need the following files:**
- 1kG_MDS3 (the bfile, i.e., 1kG_MDS3.bed, 1kG_MDS3.bim, and 1kG_MDS3.fam
