# Using PLINK to run a GWAS Analyse

### Toy data received

toy data from UKBiobank with 10_000 individuals: unrel_10k_EUR, 
Alzeihmer dissease phenotypes created following the method from Jansen et al. paper: ukb_alz.pheno
Multiple covariate (sex, localisation): ukb_alz.cov

### PLINKing

Following command is used to extract only one chromosome from the data

--bfile: tels plink it is a bed file

--chr: lets you choose the chromosome

--make-bed: recreates bed, bim, fam files 

In [14]:
%%
plink --bfile unrel_10k_EUR --chr 2 --make-bed --out 2

UsageError: Cell magic `%%` not found.


Extracting only some selected snp's to do some testing

In [1]:
%%
plink --bfile unrel_10k_EUR --snps 10:68564:A_G-10:73537:A_G, 10:82187:C_G, 10:85499:A_G --make-bed --out TESTING

UsageError: Cell magic `%%` not found.


Changing the access rights to a folder

In [15]:
%%
chmod +rwx alz_cc.plink

UsageError: Cell magic `%%` not found.


Observing the head of an interessting *zipped* data

In [16]:
%%
zcat ukb_alz.pheno.gz | head

UsageError: Cell magic `%%` not found.


Changing folders from one place to another

In [17]:
%%
scp -r alz_matthieu/ukb_alz.pheno.gz unrel_10K_EUR/

UsageError: Cell magic `%%` not found.


Running plink on the phenotypes (in this case only for chromosome 1)

--pheno: asks for a phenotype file

--pheno-name: lets you specify the name of the phenotype you want to analyse

--assoc: Simple association (important: does not run covariate, and no error statement!)



In [None]:
%%
plink --bfile 1 --pheno ukb_alz.pheno --pheno-name alz_wt --assoc --allow-no-sex --out test_1

Sexy looping instead of manually doing it for every chromosomes

In [18]:
%%
for chr in {1..23}; /
do plink --bfile $chr --pheno ukb_alz.pheno --pheno-name alz_wt /
--assoc --allow-no-sex --out test_${chr}; /
done

UsageError: Cell magic `%%` not found.


Running plink but on the full toy data

In [19]:
%%
plink --bfile unrel_10k_EUR --pheno ukb_alz.pheno --pheno-name alz_wt /
--assoc --allow-no-sex --out test_full_data

UsageError: Cell magic `%%` not found.


Grepping only the parts of interessests for a Manhattan plot
CHR (1 chromosome), SNP (2), BP (3 base pair), P (4 P-value)

In [20]:
%%
awk '{if (NR>1) print $1, $2, $3,$9}' test_full_data.qassoc | grep -v NA > plot.full_data.txt

UsageError: Cell magic `%%` not found.


Unzipping

In [21]:
%%
gunzip ukb_alz.covs.gz

UsageError: Cell magic `%%` not found.


When running for too long,
create a file (tutojob f.ex)

In [23]:
%%
nano tutojob

UsageError: Cell magic `%%` not found.


Fill the file with the following commands + the job you want to be done

--linear: make a linear regression (again here --assoc is not possible since we are using covars)

--pheno: asks for the phenotype to be analysed

--covar: asks for the covariates files

--memory: the memory use may be too small (here push from 2GB to 4GB)

--threads: Use only one core and then assign the job to multiple cores.

--maf: filter the minor allele frequency

"&": To put every part of the loop as background process



In [25]:
#!/bin/bash
#SBATCH -N 1
#SBATCH -t 75:00:00
#SBATCH --output=job%j.o
#SBATCH --error=job%j.e
cd $TMPDIR
for i in {1..16}; do
(
cp $HOME/toydata/unrel_10K_EUR/$i* ./
)
done
cp $HOME/toydata/unrel_10K_EUR/ukb_alz* ./
module load 2019
module load PLINK/2.00-alpha1-x86_64

for i in {1..16}; do
(
plink2 --bfile $i --pheno ukb_alz.pheno --pheno-name alz_lin \
    --linear --covar ukb_alz.covs --covar-name sex, f.21022.0.0, \
    assesscentre11004, assesscentre11005, assesscentre11006, assesscentre11007, \
    assesscentre11008, assesscentre11009, assesscentre11010, assesscentre11011, \
    assesscentre11012, assesscentre11013, assesscentre11014, assesscentre11016, \
    assesscentre11017, assesscentre11018, assesscentre11020, assesscentre11021, \
    assesscentre11022, assesscentre11023, pop_pc1, pop_pc2, pop_pc3, pop_pc4, \
    pop_pc5, pop_pc6, pop_pc7, pop_pc8, pop_pc9, pop_pc10, pop_pc11, pop_pc12, \
    pop_pc13, pop_pc14, pop_pc15, pop_pc16, pop_pc17, pop_pc18, pop_pc19, pop_pc20,\
    pop_pc21, pop_pc22, pop_pc23, pop_pc24, pop_pc25, pop_pc26, pop_pc27, pop_pc28, \
    pop_pc29, pop_pc30 \
    --allow-no-sex --memory 4000 --threads 1 --maf .01\
    --out FINAL_$i
)&
done
wait
cp FINAL* $HOME/toydata/unrel_10K_EUR/


UsageError: Cell magic `%%` not found.


In [None]:
#!/bin/bash
#SBATCH -N 1
#SBATCH -t 75:00:00
#SBATCH --output=job%j.o
#SBATCH --error=job%j.e
cd $TMPDIR
for i in {17..23}; do
(
cp $HOME/toydata/unrel_10K_EUR/$i* ./
)
done
cp $HOME/toydata/unrel_10K_EUR/ukb_alz* ./
module load 2019
module load PLINK/2.00-alpha1-x86_64

for i in {17..23}; do
(
plink2 --bfile $i --pheno ukb_alz.pheno --pheno-name alz_lin \
    --linear --covar ukb_alz.covs --covar-name sex, f.21022.0.0, \
    assesscentre11004, assesscentre11005, assesscentre11006, assesscentre11007, \
    assesscentre11008, assesscentre11009, assesscentre11010, assesscentre11011, \
    assesscentre11012, assesscentre11013, assesscentre11014, assesscentre11016, \
    assesscentre11017, assesscentre11018, assesscentre11020, assesscentre11021, \
    assesscentre11022, assesscentre11023, pop_pc1, pop_pc2, pop_pc3, pop_pc4, \
    pop_pc5, pop_pc6, pop_pc7, pop_pc8, pop_pc9, pop_pc10, pop_pc11, pop_pc12, \
    pop_pc13, pop_pc14, pop_pc15, pop_pc16, pop_pc17, pop_pc18, pop_pc19, pop_pc20,\
    pop_pc21, pop_pc22, pop_pc23, pop_pc24, pop_pc25, pop_pc26, pop_pc27, pop_pc28, \
    pop_pc29, pop_pc30 \
    --allow-no-sex --memory 4000 --threads 1 --maf .01\
    --out FINAL_$i
)&
done
wait
cp FINAL* $HOME/toydata/unrel_10K_EUR/

Run you job

In [26]:
%%
sbatch tutojob

UsageError: Cell magic `%%` not found.


See where your jobs using your username (here matthieu), you will be able to see the job ID

In [28]:
%%
356

UsageError: Cell magic `%%` not found.


Cancel your job

In [29]:
%%
scancel [jobid]

UsageError: Cell magic `%%` not found.


Script to concatenate multiple files together (here multiple chromosomes)
-h is for conCATanating (cat) without adding the file name

In [1]:
%%
head -1 FINAL_23.alz_wt.glm.linear > head.txt # Take the headers and put them in a separate file
grep -h ADD FINAL_*.alz_wt.glm.linear > temp.txt 
cat head.txt temp.txt > data.txt
rm temp.txt head.txt

UsageError: Cell magic `%%` not found.


count elements of a file

In [2]:
%%
wc -l <file>

UsageError: Cell magic `%%` not found.


Exctract only the data that interest us (Chrom, SNP, BP, P)

(This is done if you did not use the -h command on cat)

In [2]:
%%
#data.txt | cut -f2 -d"L" | head
awk '{if (NR>1) print $1, $3, $2, $12}' data.txt | grep -v NA > data2.txt
# If like me you had the good idea of calling your folders with an NA (FINAL) as name
awk '{if (NR>1) print $1, $3, $2, $12}' data.txt | cut -f2 -d"L" | grep -v NA > data2.txt
# If you want to change a whole line in a file (.alz_wt.glm.linear:) in another (M)
sed 's/.alz_wt.glm.linear:/M/' data2.txt > data3.txt
# Still had some issues with data processing
awk '{print $1, $2, $3, $4}' data3.txt | cut -f2 -d"M" > data.txt
rm data2.txt data3.txt
# Change the X chrom into 23 integer
sed 's/X/23/' data.txt > data2.txt
rm data.txt
# Sort the file numerically -n
sort -n data2.txt > data.txt

UsageError: Cell magic `%%` not found.


In [None]:
%%
"""
Right way to do finaly...
"""
awk '{if (NR>1) print $1, $3, $2, $12}' data.txt | grep -v NA > data2.txt
sed 's/X/23/' data2.txt > dataX.txt
sort -n dataX.txt > data_<pheno>.txt

If you want to have allele frequencies (hier van chromosome 10)



In [1]:
%%
plink2 --bfile 10 --freq --out 10


UsageError: Cell magic `%%` not found.


## Rstudio code

In [5]:
%%
library(qqman) # Library for Manhattan plot
library("data.table") # Fast aggregation of large data
library("readr") # Easy read of rectangular data
data <- fread("data2.txt", head=FALSE)
colnames(data)<-c("CHR", "SNP","BP","P") # Put a header on the data
a = subset(data, data$P<0.001) # Select only the significante SNP's (low P-Value)
jpeg("manhattan2.jpeg")
manhattan(a,chr="CHR",bp="BP",p="P",snp="SNP", main = "plot")
dev.off()

UsageError: Cell magic `%%` not found.


# JOBS Commands

In [1]:
%%
sbatch [job] #Submit your job
squeue -u [user] # Check the status of your jobs
squeue | more # TO CHECK
scancel [jobid]

UsageError: Cell magic `%%` not found.


### Check where the job is at

In [2]:
%%
slurm_joblogin <jobid>
cd /scratch/slurm.<jobid>/scratch

UsageError: Cell magic `%%` not found.


# Build Genotype Matrix



Run job to use bed, fam, bim, to build raw genotype matrix

--recode A: recode bed,fam,bim in matrix of (0,1,2)


In [1]:
%%
#!/bin/bash
#SBATCH -N 1
#SBATCH -t 75:00:00
#SBATCH --output=job%j.o
#SBATCH --error=job%j.e
cd $TMPDIR
cp $HOME/toydata/unrel_10K_EUR/10* ./

module load 2019
module load PLINK/2.00-alpha1-x86_64
plink2 --bfile 10 --recode A --out chromo_1
wait
cp chromo_1* $HOME/toydata/unrel_10K_EUR/


UsageError: Cell magic `%%` not found.


Download the matrix locally


In [3]:
%%

scp -r matthieu@lisa.surfsara.nl:toydata/unrel_10K_EUR/chromo_1.raw .
    

UsageError: Cell magic `%%` not found.


# Matlab


Read matrix in matlab


In [3]:
%%
M2 = readmatrix("minitoyfreq.txt")
size(M2) # Check the size of the matrix


UsageError: Cell magic `%%` not found.


Reschape matrix in array (123;456;789) -> [123456789]


In [4]:
%%

M3 = reshape(M2.',1,[])
size(M3) # Check the size of the array (1xn)

UsageError: Cell magic `%%` not found.


Make a matrix of repeated identical arrays [M3,M3,M3,...]


In [5]:
%%

M4 = repmat(M3, 100, 1)

UsageError: Cell magic `%%` not found.


Save variable in a file


In [6]:
%%

save("./chromo_1.mat","X",'-v7.3')


UsageError: Cell magic `%%` not found.


Reading a ped file in matlab


In [None]:
%%
# Creates a cell array

M = regexp(fileread('toy20.txt'), '\r?\n', 'split');


Change every element of the array


In [None]:
%%
indiv = {}
Y = {}
size = size(M)
for i=1:size(2)-1
    a=M{i};
    indiv{end+1} = a(1:22);
    Y{end+1} = a(23:end);
end
X = cellfun(@(v)v(1),Y)

save("./chromo_1.mat","X",'-v7.3')