Alzheimer classification

The aim of this project is to classify patients into groups healthy/ill based on their genetic information.

Versions

GNU bash 4.4.19(1)

GATK 4.0.10.1 (https://software.broadinstitute.org/gatk/)

Python 3.6.6

boruta 0.1.5
numpy 1.15.4
pandas 0.23.4
scikit-learn 0.20.1

How to use

Processing vcf files:

prepreparing.sh
make_pid-diagnoses.py
makeY.py
makeX_pooling.sh

Selection of attributes and classification:

boruta_classification.py

Steps of an examplary analysis

Using prepared testing data (400 patients, ~38k SNPs) from "./testing/files" directory.

prepare csv matrices from vcf files

./prepreparing.sh -all -tar -gz -stats -matrix -base test -vcf test_chr_SNPs.vcf -dir ${PWD}/testing/

prepare file with diagnoses

python make_pid-diagnoses.py -dir ${PWD}/testing/ -diagdir ${PWD}/testing/diagnoses/ -dataset test

build Y vectors (containing diagnosis for each patient)

python makeY.py -dir ${PWD}/testing/

build X matrices

./makeX_pooling.sh -all -dir ${PWD}/testing/

run boruta in the correct way (train/test split before selection of important SNPs)

python boruta_classification.py -boruta -dataset test ${PWD}/testing/ -borutarun 1 -test 0.1

run boruta in the wrong way (no train/test slit before selection of important SNPs)

python boruta_classification.py -boruta -dataset test ${PWD}/testing/ -borutarun 2

build random classifier based on SNPs selected from the first run of boruta, proceed classification

python boruta_classification.py -class -dataset test ${PWD}/testing/ -borutarun 1 -classrun 1

build random classifier based on SNPs selected from the second run of boruta, proceed classification

python boruta_classification.py -class -dataset test ${PWD}/testing/ -borutarun 2 -classrun 2 -test 0.1

Results of the correct classification procedure are saved in "./testing/dataset/boruta/class_results_1.txt". Results of the wrong classification procedure are saved in "./testing/dataset/boruta/class_results_2.txt".

Description

Project initially started as the part of my Bachelor's degree thesis, named "Classification of patients with Alzheimer's disease based on DNA polymorphisms". Later it has developed into bigger project of patients classification based on WGS and GWAS data from ADNI and Rosmap consortia.

Used data

Three sets of data have been used:

Whole genome sequencing (WGS) data of 486 patients (235 cases, 251 controls), obtained from ADNI consortium (https://adni.loni.usc.edu/).
WGS data of 1033 patients (530 cases, 503 controls), obtained from Rosmap project (https://www.synapse.org/#!Synapse:syn10901595).
Data from Genome-wide association study (GWAS) of 432 patients, obtained from ADNI consortium.

Basic steps of analysis

The basic analysis of each data set can be described by following steps:

Rewriting genetic data into matrices and information about patients and diagnoses into text files.
Division of patients into training and testing set.
Selection of the most important SNPs by Boruta algorithm based on training set of patients.
Training the Random Forest classifier based on selected SNPs.
Testing the classifier based on testing set of patients.

Boruta algorithm

Boruta algorithm has been developed by Miron B. Kursa and Witold R. Rudnicki ("Feature Selection with the Boruta Package" - Journal of Statistical Software, Vol 36 (2010), https://www.jstatsoft.org/article/view/v036i11). In this project Boruta implementation for Python made by Daniel Homola (http://danielhomola.com/2015/05/08/borutapy-an-all-relevant-feature-selection-method/) was used.

Boruta is a feature selection method. It is designed as a wrapper around a Random Forest classification algorithm. It iteratively removes the features which are proved by a statistical test to be less relevant than random probes.

Additional analysis

analysis of subset of SNPs (e.g. SNPs shared between two data sets)
removing outlier patients

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
.idea		.idea
SNPs_subsets		SNPs_subsets
__pycache__		__pycache__
patients_similarities		patients_similarities
plink_rewrite		plink_rewrite
testing		testing
boruta_classification.py		boruta_classification.py
corporate_funcs.py		corporate_funcs.py
documentation.md		documentation.md
exceptions.py		exceptions.py
feature_selection.py		feature_selection.py
get_snps_locations.py		get_snps_locations.py
job_pool.sh		job_pool.sh
lack_of_data_stats.py		lack_of_data_stats.py
makeX.py		makeX.py
makeX_pooling.sh		makeX_pooling.sh
makeY.py		makeY.py
make_bedfile.py		make_bedfile.py
make_pid-diagnoses.py		make_pid-diagnoses.py
measure_time_memory.py		measure_time_memory.py
nodata_vs_importance.py		nodata_vs_importance.py
prepreparing.sh		prepreparing.sh
readme.md		readme.md
thesis.pdf		thesis.pdf
vcf_stats.py		vcf_stats.py
vcf_to_matrix.py		vcf_to_matrix.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Alzheimer classification

The aim of this project is to classify patients into groups healthy/ill based on their genetic information.

Versions

How to use

Steps of an examplary analysis

Description

Used data

Basic steps of analysis

Boruta algorithm

Additional analysis

About

Releases

Packages

Languages

marnifora/alzheimer_classification

Folders and files

Latest commit

History

Repository files navigation

Alzheimer classification

The aim of this project is to classify patients into groups healthy/ill based on their genetic information.

Versions

How to use

Steps of an examplary analysis

Description

Used data

Basic steps of analysis

Boruta algorithm

Additional analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages