Skip to content

The software for the Intitutes genotyping array based blood group typing

License

Notifications You must be signed in to change notification settings

ikmb/BloodTypingArray

Repository files navigation

BloodTypingArray

A software package for array-based molecular blood group typing.
The software infers blood group genotypes measured on an Illumina custom array from the nucleotide to the phenotypic blood group level. Input files are FinalReport files exported from Illumina GenomeStudio. BloodTypingArray determines the blood group alleles either directly by using a SNP-to-blood genotype dictionary or indirectly based on a machine learning approach using TensorFlow [REF1].

There are two ways run the software. The most convenient way is to use a conda environment with all dependencies resolved. To do so, you must have conda installed and copy and paste the following commands to the terminal. If you have a developer OS and do not want to use our preconfigured conda environment, you can instead follow the step-by-step installation below.

# Clone the repository
git clone git@github.com:ikmb/BloodTypingArray.git
# Create conda environment
cd BloodTypingArray/
# The following command creates an environment called bloodtypingarray-1.0
conda env create -f environment.yml
conda activate bloodtypingarray-1.0
# Create executables
./make_executables.sh

You can now jump to the test section of this README.

Step-by-Step installation

Compiling the source code

All executables that are generated during the next step are usually located in a sub-folder dist/Release/GNU-Linux/. The latter part (GNU-Linux) may be different at your OS, so adapt accordingly. Also line 21 and 22 of file ./DeepBloodArray/classifyFinalReports.py may require specific adaptation.

  1. Install the required tools
sudo apt-get install build-essential

Please make sure you also have the static versions of glibc and stdlibc installed. What you must install depends on what OS you are running. Eventually, you need the static versions of glibc and libstdc++

# CentOS
sudo yum install glibc-static libstdc++-static -y
# Ubuntu
sudo apt-get install libc6-dev
# ...
  1. Build MyTools
    MyTools is a collection of useful cpp classes/methods and must be compiled before compiling the other cpp projects
cd MyTools/
make all
cd ..

The library can be found under: dist/Release/GNU-Linux/

  1. Build bloodArray
    BloodArray reads FinalReport-files, generated by an Illumina GenomeStudio export, and returns a tab delimited table with the inferred blood group alleles. All output goes to stdout, log goes to stderr. This is the experimental direct caller part of the software package. If you are interested in this software part, you should take the logic from the phenotype() functions and reprogram it in a language of your choice.
cd bloodArray/
make all
cd ..

The executable can be found under: dist/Release/GNU-Linux/
run like:
bloodArray FinalReport1.txt [FinalReport2.txt ... FinalReportN.txt]

  1. Build FinalReportToEvoker
    FinalReportToEvoker generates evoker file(s) from FinalReport-files.
cd FinalReportToEvoker/
make all
cd ..

The executable can be found under: dist/Release/GNU-Linux/

Evoker file format usually consists of binary plink [REF2] files plus an extra file with the Allele-AB intensities in binary format. We need these intensities for the TensorFlow classifier. With FinalReportToEvoker we generate a fam file (sample annotation), bim file (SNP annotation) and a bnt file (the Allele-AB intensities). We do not generate the bed file (Genotypes in binary plink format) as we do not need the genotypes.
NOTE: FinalReportToEvoker does no parameter evaluation. If something goes wrong, it crashes without meaningful messages. Please run like:
FinalReportToEvoker consider_those.csv OUTPUFAMFILE OUTPUTBIMFILE OUTPUTBNTFILE FINALREPORTFILE1 [FINALREPORTFILE2 ... FINALREPORTFILEN]

consider_those.csv is a text file with two comma separated columns. One column with the antigen/allele and the other column with the required probe_set_ids. This file can be found in the folder DeepBloodArray.

DeepBloodArray

DeepBloodArray is a python project and contains python script files that are used to infer blood group alleles for the blood groups Rh and MNS. It needs an appropriate environment and is mainly based the following two script files:

  1. trainAndEvaluate.py trains a new classifier
  2. classifyFinalReports.py takes a final report as input and returns a json file with the blood group alleles

The folder Models contains pre-trained models.

Create the Conda environment

  1. Install miniconda3
  2. environment setup:
conda create -n MyEnvName
conda activate MyEnvName
conda install -c conda-forge tensorflow scikit-learn pandas -y

Deep learning background

The neural network was constructed using TensorFlow's Keras API. The architecture consists of three stacked layers: an input layer with 9 neurons, a hidden layer with 6 neurons, and an output layer with one neuron. Throughout the network, ReLU was used as an activation function except for the last neuron which utilizes a Sigmoid activation function that produce values between zero and one. The input shape is determined by the number of SNPs used for the training and so varies for the different antigens. During training, we used RMSprop as an optimizer, and the model was trained for 100 epochs. Lastly, we used binary cross entropy as a loss function to calculate the loss and to optimize the model's weight using RMSprop, as a result the model's predictions shall be interpreted as class probabilities as the problem and the object function have been framed as a binary classification problem.

Test

direct typing using the executable bloodarray

cd BloodTypingArray
bloodArray/dist/Release/GNU-Linux/bloodarray DeepBloodArray/test/FinalReport1.txt

returns:

Sample_ID	filename	ABO	Rh	Lutheran	Kell	Duffy	Kidd	Diego	Yt	Scianna	Dombrock	Colton	Landsteiner-Wiener	CROM	Knops	JR	LAN	Vel	IndianMNS	Rh
Sample_ID	filename	ABO	RH	LU	KEL	FY	JK	DI	YT	SC	DO	CO	LW	CROM	KN	JR	LAN	VEL	Indian	MNS	RH
Sample_ID	filename	ABO	RHD	BCAM	KEL	ACKR1	SLC14A1	SLC4A1	ACHE	ERMAP	ART4	AQP1	ICAM4	CD55	CR1	ABCG2	ABCB6	SMIM1	CD44	GYPA,GYPB	RHCE
Sample_ID	filename	001	004	005	006	008	009	010	011	013	014	015	016	021	022	032	033	034	023	002	004
pseudoID	FinalReport1.txt	A	D.	Lu(a-b+),Au(a-b+),Lu8+Lu14-	kk,Kp(a-b+),Js(a-b+)	Fy(a-b+)	Jk(a-b+)	Di(a-b+),Wr(a-b+)	Yt(a+b-)	Sc1+Sc2-	Do(a+b+),Hy+,Jo+	Co(a+b-),	LW(a+b-)	Cr(a+),Tc(a+b-c-)	Kn(a+b-),McC(a+b-),Vil-	Jr(a+)	Lan+	Vel+	#N/A

Generating evoker files from FinalReports

FinalReportToEvoker/dist/Release/GNU-Linux/finalreporttoevoker DeepBloodArray/consider_those.csv out.fam out.bim out.bnt DeepBloodArray/test/FinalReport1.txt

should generate the three output files out.fam, out.bim, out.bnt

Molecular blood group typing, main script.

Uses the two executables tested before and runs the classifier. Finally, it generates a json output.

# If your conda environment is not activated, please activate it with
conda activate MyEnvName

Run the script:

python3 DeepBloodArray/classifyFinalReports.py

should create the result file data_sampleID.json

Test the training function of the classifier

python3 DeepBloodArray/trainAndEvaluate.py --output /your/output/directory

This should run the training of the classifiers for different antigens ['c','C','e','E','M','N','s','S']
The output directory will contain the trained models (*.mdl) and different plots. An overview plot of the score distribution of the validation samples (20% of all samples) and scatter plots for every SNPs with allele_AB intensities and a color-code that shows to which group the corresponding sample belongs. The training data is reduced to HGDP individuals only. To reproduce the published work, you need to request the German cohort data which are subject to controlled access data protection from PopGen 2.0 Network (P2N) biobank (Access token: P2N_859BH) and add it before running the training (see next section). The trained models provided in this repository were trained with HGDP and P2N samples.

Data

The raw data are stored in rawData_HGDP_only.zip and this archive contains 334 FinalReport files (exported raw data in text format). These are HGDP samples only. To receive the North German samples, please send a request to "PopGen 2.0 Netzwerk (P2N)", transfer@p2n-sh.de quoting the access token P2N_859BH.

References

  1. Abadi, M, Barham, P, Chen, J, Chen, Z, Davis, A, Dean, J, Devin, M, Ghemawat, S, Irving, G, Isard, M, Kudlur, M, Levenberg, J, Monga, R, Moore, S, Murray, DG, Steiner, B, Tucker, P, Vasudevan, V, Warden, P, Wicke, M, Yu, Y, Zheng, X. {TensorFlow}: a system for {Large-Scale} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 2016. (pp. 265-283).
  2. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007 Sep;81(3):559-75. doi: 10.1086/519795. Epub 2007 Jul 25. PMID: 17701901; PMCID: PMC1950838.

About

The software for the Intitutes genotyping array based blood group typing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published