# Getting Started

LINUX users can open the terminal window with:

In [None]:
Ctrl+Alt+T

Login into rocket with your account:

In [None]:
ssh your_account_id@rocket.hpc.ut.ee

### Good practices when starting a new project

It is good practice to keep your directory clean and well-organized. 

For example, given that today we will work on the non imputed dataset we can create right away the directory 'non_imputed' with the command `mkdir` (mkdir = make directory), where we will store the dataset and run the analyses.

In [None]:
mkdir non_imputed

To move into the newly create directory you can use the `cd` command (cd = change directory).

In [None]:
cd non_imputed

There, you can create right away two directories: `dataset`, where you can store the dataset, and `analyses`, where we will run our tests. 

In [None]:
mkdir dataset 
mkdir analyses

### Get the dataset
The vcf is readily available at /gpfs/helios/home/etais/hpc_lm_echo/AADR_dataset/non_imputed_set.vcf. Be sure that you are currently in your 'non_imputed/dataset' folder and copy the file in your directory is this way

In [None]:
cp /gpfs/helios/home/etais/hpc_lm_echo/AADR_dataset/non_imputed_set.vcf ./

## Explore the Dataset

We are working on a remote server, where multiple users work with different software needs. Managing many softwares and many versions is complex, and can cause dependency issues. 

Rather than having installed and readily available all softwares we use **modules**. Modules allow to manage and load software environments dynamically, ensuring that the correct versions of software and their dependencies are used without conflicts.

Before **loading** a specific module, we need to search for it, to know whether it is available and under what name. 

`module spider` will provide a list of all available softwares and their version under the searched name. 

`module spider plink`

`module load` is used to activate a specific software environment by loading the desired module. 

`module load plink/1.9-beta6.27`

# Dataset seen through PLINK

In [None]:
module spider plink
module load plink/1.9-beta6.27

Convert vcf to plink format

In [None]:
plink --vcf non_imputed_set.vcf --make-bed --out non_imputed_set

While converting PLINK informs us about:
 * Number of SNPS (variants)
 * Number of individuals (people) and number of male/females
 * Genotyping rate

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to non_imputed_set.log.
Options in effect:
  --make-bed
  --out non_imputed_set
  --vcf non_imputed_set.vcf

31510 MB RAM detected; reserving 15755 MB for main workspace.
--vcf: non_imputed_set-temporary.bed + non_imputed_set-temporary.bim +
non_imputed_set-temporary.fam written.
597573 variants loaded from .bim file.
523 people (0 males, 0 females, 523 ambiguous) loaded from .fam.
Ambiguous sex IDs written to non_imputed_set.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 523 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate is 0.743685.
597573 variants and 523 people pass filters and QC.
Note: No phenotypes present.
--make-bed to non_imputed_set.bed + non_imputed_set.bim + non_imputed_set.fam
... done.

## Estimating the number of SNPs and samples

We can easily count the number of SNPs by counting the lines in the bim file with bash `wc -l`.

`wc` = word count \
`-l` = lines

In [None]:
wc -l non_imputed_set.bim 

Similarly, we can count the number of Samples by counting the lines in the fam file

In [None]:
wc -l non_imputed_set.fam

If the FID are available, we can estimate the number of samples per cluster

In [None]:
awk '{print $1}' non_imputed_set.fam | sort | uniq -c

In [None]:
   3 Albanian.HO
     28 Basque.HO
      4 Belarusian.HO
      2 Bulgarian.HO
      4 Croatian.HO
      2 Cypriot.HO
      3 Czech.HO
     15 Denmark.Zealand.SM
     34 Eng.EM
      1 Eng.Kent.ASEM
      2 English.HO
      3 Eng.Norfolk.ASEM
     84 Eng.S
      5 Eng.Scorton.Anglian
      1 Eng.Suffolk.ASEM
      9 Estonian.HO
      5 Finnish.HO
     14 French.HO
     15 Germany.AltInden.SEM
     11 Germany.Anderten.SM
     16 Germany.Drantum.SM
     11 Germany.Dunum.SM
      4 Germany.Hiddestorf.SEM
      2 Germany.Issendorf.SEM
      4 Germany.Liebenau.SEM
     15 Germany.Schleswig.SLM
      4 Germany.Schortens.SEM
      2 Germany.Zetel.SEM
     10 Greek.HO
     11 Hungarian.HO
     10 Icelandic.HO
     38 Ire.Kilteasheen.ASEM
     11 Italian.North.HO
      1 Italian.Sardinian.HO
      7 Lithuanian.HO
      5 Maltese.HO
      2 Ned.Friesland.SM
     15 Ned.Groningen.SM
      4 Norwegian.HO
     12 Orcadian.HO
      4 Romanian.HO
     20 Russian.HO
     22 Sardinian.HO
      3 Scottish.HO
      4 Sicilian.HO
     32 Spanish.HO
      5 Spanish.North.HO
      4 Ukrainian.HO

## Allele Frequencies
The dataset can be characterized by many rare or many common variants. The allele frequencies will impact the study, and the kind of analyses we can carry on the dataset. For an insight of the allele frequencies in out dataset we can use the `--freq` options in PLINK, which writes a minor allele frequency report to an output **.frq** file.

In [None]:
plink --bfile non_imputed_set --freq --out allele_freq_report

And we can see the first few lines with the command `head`

In [None]:
head allele_freq_report.frq

The file has 6 columns:

* CHR	Chromosome number
* SNP	Variant identifier
* A1	Allele 1 (usually minor)
* A2	Allele 2 (usually major)
* MAF	Allele 1 frequency, stands for Minor Allele Frequency
* NCHROBS	Number of allele observations (number of samples*2)


### How many alleles have a MAF lower than 0.05?

In [None]:
awk '$5 < 0.05 {print $5}' allele_freq_report.frq | wc -l

We can remove them with the plink option `--maf`.

In [None]:
plink --bfile non_imputed_set --maf 0.05 --make-bed --out non_imputed_set_maf05

## Evaluating the level of missingness in our dataset

In [None]:
plink --bfile non_imputed_set --missing --out missingness_check

PLINK will create two output, with thr prefix 'missingness_check'. 
* missingness_check.imiss
* missingness_check.lmiss

We can check the first ten lines of files contain with the command `head`:

### missingness_check.imiss

In [None]:
head missingness_check.imiss

**i**miss stantds for individuals missingness, the file summarizes the missing genotype rates per individual. 
* FID contains the sample cluster/family name
* IID contains the sample individual ID
* MISS_PHENO indicates whether the phenotype information is missing (Y/N)
* N_MISS	Number of missing genotype call(s)
* N_GENO	Number of potentially valid call(s)
* F_MISS	Missing call rate, proportion of N_MISS/N_GENO

We can check the 20 samples with highest missingness

In [None]:
sort -k6 missingness_check.imiss | tail -n 20

#### Removing samples with high missingness
For filtering out low-quality samples we can use the PLINK option `--mind N`. With this option we will remove all samples with a missingess equal or higher than N. 

In [None]:
plink --bfile non_imputed_set --mind 0.9 --make-bed --out non_imputed_set_mind9 

### missingness_check.lmiss

In [None]:
head missingness_check.lmiss

**l**miss stands for locus missingness, it summarizes the missing genotype rates per SNP (locus).
* CHR column inform on the chromosome number.
* SNP is SNP ID.
* N_MISS Number of missing genotypes for this SNP across all individuals.
* N_GENO Total number of genotypes for this SNP across all individuals.
* F_MISS Proportion of missing genotypes for this SNP (N_MISS / N_GENO).

#### Removing SNPs with high missingness
For filtering out low-quality SNPs we can use the PLINK option `--geno N`. With this option we will remove all variants with a missingess equal or higher than N. 

In [None]:
plink --bfile non_imputed_set --geno 0.9 --make-bed --out non_imputed_set_geno9 

## Linkage Disequilibrium

Linkage disequilibrium refers to the non-random association of alleles at different loci. When two SNPs are in LD, they tend to be inherited together more often than would be expected by chance. \
When not accounting for LD, we mistakenly interpret an association between two variants as causal, when in reality the association could be due to their co-inheritance.

To remove sites in LD, we can use PLINK option `--indep-pairwise`. It takes 3 parameters: window size, step size, r^2 threshold.

1) We select a window size in variant count (i.e. 100, will select a window with 100 SNPs, first parameter)
2) We remove all SNPs with a r^2 equal or higher than our threshold (i.e. 0.1, third parameter)
3) PLINK will shift the window a number of variants, and perform step 1 and 2 (i.e. 10, will shift the window 10 SNPs, second parameter)


In [None]:
--indep-pairwise <window size>['kb'] <step size (variant ct)> <r^2 threshold>

In [None]:
plink --bfile non_imputed_set --indep-pairphase 1000 100 0.5 --out pruning_report

This command will output two files: pruning_report.prune.in and pruning_report.prune.out
* prune.in, variants that passed the pruning parameters
* prune.out, variants that did NOT pass the pruning parameters

We can now subset the TEST_DATASET based on the pruning_report files, by running **one** of these two line of code.

In [None]:
plink --bfile non_imputed_set --extract pruning_report.prune.in --make-bed --out non_imputed_set_pruned 
plink --bfile non_imputed_set --exclude pruning_report.prune.out --make-bed --out non_imputed_set_pruned 

# Visualizing the genetic variability with PCA

We are providing you a script that will perform the following:
1) Convert vcf to plink format
2) Apply maf and LD filters
3) Convert plink file to eigenstrat format
4) Run smartpca

In [None]:
python ../FinalScripts/VCF2smartpca.py
--name my_pca_script 
--input_file non_imputed_set 
--maf 0.05 
--pruning '1000 100 0.5' 
--pca_project YES 
--pca_controls Albanian.HO,Basque.HO,Belarusian.HO,Bulgarian.HO,Croatian.HO,Cypriot.HO,Czech.HO,
English.HO,Estonian.HO,Finnish.HO,French.HO,Greek.HO,Hungarian.HO,Icelandic.HO,
Italian_North.HO,Italian_Sardinian.HO,Lithuanian.HO,Maltese.HO,Norwegian.HO,Orcadian.HO,
Romanian.HO,Russian.HO,Sardinian.HO,Scottish.HO,Sicilian.HO,Spanish.HO,Spanish_North.HO,Ukrainian.HO

The script will create a "my_pca_script.sh" bash file with all command lines needed

In [None]:
more my_pca_script.sh

In [None]:
sbatch my_pca_script.sh

Once the script is done running, you will find in your directory two files:
* **pca.evec**, containing the coordinates of each individual along each principal component (eigenvectors).
* **eval**, containing the eigenvalues, indicating how much variance each principal component explains.

#### Calculate the variance explained for each PC

In [None]:
awk '{sum += $1} END {print sum}' non_imputed_set_maf0.05_pruned.eval

Let's say it returns 98

In [None]:
cat non_imputed_set_maf0.05_pruned.eval | while read line; 
do explained_variance=$(echo "scale=4; $line / 98 * 100" | bc); 
echo "Explained variance: $explained_variance%" >> explained_variance_output.txt; done

#### You can plot the PCA using PCA_plotly.py script. 
**INSTALL PLOTLY AND PANDAS FIRST**

In [None]:
python PCA_plotly.py 
--input_file non_imputed_set_maf0.05_pruned.pca.evec 
--output_file output