# Task_1: Preparing the dataset.

This first task of the protocol is aimed at preparing a working dataset in PLINK v1.9 binary format with all SNPs identified by the rs number and coordinates based on the genome build GRCh37/hg19; as required by the Michigan imputation server.

Unmapped and uncertain location variants will be removed.

Run this template if you need to change the build; otherwise you can skip this and go to task 2. 

*These steps are much quicker in bash than in R, even though some steps can take a while as the input dataset is usually big.*

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#UCSC--LiftOver.-Change-the-genomic-assembly-to-build-hg19/GRCh37-(requiered-for-imputation-at-Michigan-Server)" data-toc-modified-id="UCSC--LiftOver.-Change-the-genomic-assembly-to-build-hg19/GRCh37-(requiered-for-imputation-at-Michigan-Server)-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>UCSC  LiftOver. Change the genomic assembly to build hg19/GRCh37 (requiered for imputation at Michigan Server)</a></span></li><li><span><a href="#Update-the-database" data-toc-modified-id="Update-the-database-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Update the database</a></span></li><li><span><a href="#Check-for-duplicates" data-toc-modified-id="Check-for-duplicates-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Check for duplicates</a></span></li><li><span><a href="#Update-to-rs-to-obtain-Final-DB-for-QC" data-toc-modified-id="Update-to-rs-to-obtain-Final-DB-for-QC-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Update to rs to obtain Final DB for QC</a></span></li></ul></div>

In [2]:
%load_ext rpy2.ipython

In [3]:
%env Path=/mnt/data/GWAS/output/task1_preQC

env: Path=/mnt/data/GWAS/output/task1_preQC


## UCSC  LiftOver. Change the genomic assembly to build hg19/GRCh37 (requiered for imputation at Michigan Server)

In [4]:
%%bash
# Update b36 to b37
head /mnt/data/GWAS/input/dataset.b36.bim

1	1:107	0	107	T	TA
1	1:177	0	177	AC	A
1	1:350	0	350	T	A
1	1:355	0	355	G	A
1	1:462	0	462	GT	G
1	1:871	0	871	G	C
1	1:875	0	875	G	C
1	1:2973	0	2973	A	G
1	1:2979	0	2979	G	T
1	1:2981	0	2981	G	A


In [5]:
%%bash
awk '{OFS="\t"; print "chr"$1,$4,$4+1,$2}' /mnt/data/GWAS/input/dataset.b36.bim | sed 's/chr23/chrX/g' | sed 's/chr24/chrY/g' | sed 's/chr26/chrM/g' > $Path/UCSC_b36.bed
head $Path/UCSC_b36.bed

chr1	107	108	1:107
chr1	177	178	1:177
chr1	350	351	1:350
chr1	355	356	1:355
chr1	462	463	1:462
chr1	871	872	1:871
chr1	875	876	1:875
chr1	2973	2974	1:2973
chr1	2979	2980	1:2979
chr1	2981	2982	1:2981


In [10]:
%%bash
# UCSC liftOver to get the same reference build. It creates two output files, hglft_genome.bed and unmapped.bed
# 1st parameter: bed file
# 2nd parameter: fixed path - do not change it
# 3rd parameter: path to output file
# 4th parameter: unmapped SNPs
/mnt/data/GWAS/tools/liftOver $Path/UCSC_b36.bed /mnt/data/GWAS/ref_files/hg18ToHg19.over.chain.gz $Path/hglft_genome.bed $Path/unmapped.bed

Reading liftover chains
Mapping coordinates


In [6]:
%%bash
head $Path/hglft_genome.bed

chr1	10107	10108	1:107
chr1	10177	10178	1:177
chr1	10350	10351	1:350
chr1	10355	10356	1:355
chr1	10462	10463	1:462
chr1	11008	11009	1:871
chr1	11012	11013	1:875
chr1	13110	13111	1:2973
chr1	13116	13117	1:2979
chr1	13118	13119	1:2981


**hglft_genome.bed file is not the same as the execution before!!**

In [7]:
%%bash
head $Path/hglft_genome.bed

chr1	10107	10108	1:107
chr1	10177	10178	1:177
chr1	10350	10351	1:350
chr1	10355	10356	1:355
chr1	10462	10463	1:462
chr1	11008	11009	1:871
chr1	11012	11013	1:875
chr1	13110	13111	1:2973
chr1	13116	13117	1:2979
chr1	13118	13119	1:2981


##  Update the database

Update the chr, the basepair and the name for each SNP.

In [8]:
%%bash
awk '{OFS="\t"; print $4,$1}' $Path/hglft_genome.bed | sed 's/chrX/chr23/g' | sed 's/chrY/chr24/g' | sed 's/chrM/chr26/g'| sed 's/chr//g'  > $Path/update_chr.txt
awk '{OFS="\t"; print $4,$2}' $Path/hglft_genome.bed > $Path/update_bp.txt
awk '{OFS="\t"; print $4,$1":"$2}' $Path/hglft_genome.bed | sed 's/chr//g' > $Path/update_name.txt
awk '{OFS=""; print $1,":",$2}' $Path/hglft_genome.bed  | sed 's/chr//g' > $Path/UCSC_b37.txt

In [26]:
%%bash
split -l 3349530 --numeric-suffixes $Path/UCSC_b37.txt  $Path/UCSC_b37_

In [9]:
%%bash
fgrep -wf $Path/UCSC_b37_00 /mnt/data/GWAS/ref_files/1000G_to_rs_dbSNP37_Phase3_single_rs > $Path/UCSC_b37_rs_00.txt
fgrep -wf $Path/UCSC_b37_01 /mnt/data/GWAS/ref_files/1000G_to_rs_dbSNP37_Phase3_single_rs > $Path/UCSC_b37_rs_01.txt
fgrep -wf $Path/UCSC_b37_02 /mnt/data/GWAS/ref_files/1000G_to_rs_dbSNP37_Phase3_single_rs > $Path/UCSC_b37_rs_02.txt
cat $Path/UCSC_b37_rs_00.txt $Path/UCSC_b37_rs_01.txt $Path/UCSC_b37_rs_02.txt > $Path/UCSC_b37_rs.txt
rm $Path/UCSC_b37_rs_0*

Process is terminated.


Prepare the unmapped SNPs list to PLINK

In [68]:
%%bash
awk '{OFS="\t"; print $4}' $Path/unmapped.bed | sed '/^$/d' > $Path/rs_to_exclude
head $Path/rs_to_exclude
wc $Path/rs_to_exclude

4:3964514
7:130526
7:142154515
8:17394388
8:86068608
 5  5 53 /mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/rs_to_exclude


Start prunning with PLINK. Remove Unmapped SNP. Update position (--update-map chr and bp). 

In [69]:
%%bash
# Find plink logs in the Jupyter File view, in this working directory
/usr/lib/plink1.9/plink --bfile $Path/dataset.b36 --exclude $Path/rs_to_exclude --update-chr $Path/update_chr.txt --allow-extra-chr --make-bed --zero-cms --out $Path/temp1
/usr/lib/plink1.9/plink --bfile $Path/temp1 --update-map $Path/update_bp.txt --allow-extra-chr  --make-bed --out $Path/temp2
/usr/lib/plink1.9/plink --bfile $Path/temp2 --update-name $Path/update_name.txt --allow-extra-chr --chr 1-26 --make-bed --out $Path/temp3

PLINK v1.90b3.45 64-bit (13 Jan 2017)      https://www.cog-genomics.org/plink2
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/temp1.log.
Options in effect:
  --allow-extra-chr
  --bfile /mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/dataset.b36
  --exclude /mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/rs_to_exclude
  --make-bed
  --out /mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/temp1
  --update-chr /mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/update_chr.txt
  --zero-cms

257718 MB RAM detected; reserving 128859 MB for main workspace.
551154 variants loaded from .bim file.
503 people (240 males, 263 females) loaded from .fam.
503 phenotype values loaded from .fam.
--exclude: 551149 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 503 founders and 0 nonfounders pr

/mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/temp1.hh ); many
commands treat these as missing.
/mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/temp2.hh ); many
commands treat these as missing.
/mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/temp3.hh ); many
commands treat these as missing.


##  Check for duplicates

In [70]:
%%bash
#  Remove SNP duplicates
sed 's/ /\t/g' $Path/temp3.bim  | awk '{print $2}' | sort | uniq -c| awk '{if($1>1) print $2}'> $Path/remove_duplicates.txt

In [71]:
%%bash
/usr/lib/plink1.9/plink --bfile $Path/temp3 --exclude $Path/remove_duplicates.txt --make-bed --out $Path/temp4

PLINK v1.90b3.45 64-bit (13 Jan 2017)      https://www.cog-genomics.org/plink2
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/temp4.log.
Options in effect:
  --bfile /mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/temp3
  --exclude /mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/remove_duplicates.txt
  --make-bed
  --out /mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/temp4

257718 MB RAM detected; reserving 128859 MB for main workspace.
551149 variants loaded from .bim file.
503 people (240 males, 263 females) loaded from .fam.
503 phenotype values loaded from .fam.
--exclude: 551149 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 503 founders and 0 nonfounders present.
Calculating allele frequencies... 1011121314151617181920212223242526272829303132333435363738394041424344454647484950

/mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/temp4.hh ); many
commands treat these as missing.


## Update to rs to obtain Final DB for QC


In [73]:
%%bash
/usr/lib/plink1.9/plink --bfile $Path/temp4 --update-name $Path/UCSC_b37_rs.txt --make-bed --out $Path/dataset.b37
rm $Path/temp*

PLINK v1.90b3.45 64-bit (13 Jan 2017)      https://www.cog-genomics.org/plink2
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/dataset.b37.log.
Options in effect:
  --bfile /mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/temp4
  --make-bed
  --out /mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/dataset.b37
  --update-name /mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/UCSC_b37_rs.txt

257718 MB RAM detected; reserving 128859 MB for main workspace.
551149 variants loaded from .bim file.
503 people (240 males, 263 females) loaded from .fam.
503 phenotype values loaded from .fam.
--update-name: 496363 values updated.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 503 founders and 0 nonfounders present.
Calculating allele frequencies... 101112131415161718192021222324252627282930313233343536373839404142434445

/mnt/Almacen6/Adapted/jupyterAnalisis_others/Genomic_pipeline/dataset.b37.hh );
many commands treat these as missing.


**my notes:** este es el notebook original de Mariu - no lo vuelvo a correr. Se queda como definitivo.