This introduction is divided into three sections:
- the contents sent by email to the user on finishing of a full analysis-run,
- the use and setup of the analysis modules,
- the algorithm components behind-the-scenes,
Together, these three aspects make up the functionality of impute.me.
Part 1: Downloads descriptions
On each completed analysis-run an email is sent with the user's uniqueID. This uniqueID can be used to browse the modules described in part 2, but in addition three links are provided for direct download:
Simple-format imputed output. This format is also sometimes called 23andme-format because it was standardized by the company 23andme. It contains rows of all SNPs with rs-ID, chromosome, position, and genotype. When given as output from the impute.me-site, the only difference is that a lot more SNPs are available. This is a consequence of the imputation. The file is zipped and divided by chromosome.
Gen-format files contains similar information to the 23andme-format files, i.e. all your imputed SNPs. But they do so in a more completely descriptive format, one that also reflects considerations of statistical uncertainty of genotype calls. This format is called the gen-format. The key to interpretation is that each of the three right-most columns contains probabilistic information regarding your genotype. This means that a value of A C 1 0 0 is almost sure to be A/A, whereas A C 0.7 0.3 0 could be interpreted as both A/A but with some probability of being A/C.
JSON-format files contains calculated phenotypical data. This file contains the output of all the genetic-calculators that a sample is subjected to at impute.me. The JSON-format is made to be computer-readable, but many good online viewers exists, so you can easily inspect the data yourself. However, since the same information is given in the browsable modules described in next section, the only purpose of providing this format is long-term storage and data-interchange.
Part 2: Module design description
Each specific module function is documented by their UI-provided description. The general setup requires the input of a uniqueID, which links to the user's data. For many of the modules, the calculations are trivial. For example this could be the reporting of presence and/or absence of a specific genotype. For others, we rely heavily on polygenic risk scores. Three approaches to polygenic risk scores are implemented in the function get_GRS_2. Since it is an over-arching concept, the different types of risk score calculations are explained here:
Basic count score. Basically just counting the effect alleles. This is the most simple setup of polygenic risk scores. It is intuitive to understand - the more risk alleles, the higher the score. The main drawback is that it doesn't distinguish SNPs with large and small effects.
Count-score = Σ Effect-allele-countsnp (I)
Weighted-score. A score that is weighted by the effect size of each SNP. This has the added benefit of weighting SNPs with large effect sizes more than SNPs with small effect sizes. Note that beta is changed for log(OR) as applicable for binary traits. The only draw-back of this score type is that it is on an arbitrary scale and does little to inform about risk compared to the rest of the population.
Weighted-score = Σ Betasnp * Effect-allele-countsnp (II)
Z-score. A score that is given in standard-deviations above or below the average risk-score for that population. This specific implementation of the Z-score is found here. The frequencysnp is obtained from 1000 genomes data for the relevant super-population. Effect-allele-count and Beta is used as in previous scores. The Standard-deviationpopulation is calculated according to this code. In many of the modules an extra step is added where the Z-score is converted to percentage of population with lower score. This is done with the standard pnorm function, i.e. we assume everything is normally distributed. To check the validity of this assumption, some modules have an option to compare to real distributions.
Population-scoresnp = frequencysnp * 2 * betasnp (III)
Zero-centered-score = Σ Betasnp * Effect-allele-countsnp - Population-scoresnp (IV)
Z-score = Zero-centered-score / Standard-deviationpopulation (V)
These scores are extensively used as described in each module. For further understanding of each module, refer to source code of that module. This pdf-file contains further illustrated examples of PRS derivation. The set up of input SNPs is sought to be as complete as possible, e.g. using LDpred info. However, more often it is based on top-significant (pruned) hits out of necessity. This is also indicated on a per-module basis.
When reading the code for each module, the main parts are the ui.R and a server.R file. The details of this setup of this can be found in the R/Shiny documentation. Shiny is the interface language that have been used to create these modules. A template module which contains the very minimal configuration is found in the 'template' folder. Additional module-specific files that are often used is the SNPs_to_analyze.txt file and the export_script.R, further described elsewhere.
Part 3: Imputation algorithm description
After upload of data, a central aspect of the analysis engine is the performance of an imputation-process. The purpose of this process is to 'fill in the blanks', meaning to provide educated guesses of missing SNPs. Imputation allows the analysis modules to connect to the latest research as provided by the modules. Unlike the two previous sections, it is not at all necessary to understand these aspects of the impute.me algorithm. They are provided here for reference:
(1) prepare_individual_genome (a function) which is triggered at data-submission time (not cron-job dependent). This function simply unpacks the data, copies it to the designated imputation area (~/imputations), performs a few data consistency checks and assigns an unique id to each submission.
(2) run_imputation (a function) is triggered by the cron-job checking for ready data in the imputation area (~/imputations). It consists of a series of calls to bash-based programs. First a shapeit call is made to phase the data correctly. Note that there is quite a lot of extra work involved in avoiding errors from single-sample-homozygote problems and such (up untill the cmd4 call). After shape-it calls, a call to impute2 is made and the end-product of this function is per-chromosome gen-files entitled "step_7_chr_xx", because they come from the seventh step of calculation.
(3) summarize_imputation (a function) is run immediately after run_imputation, in the cron-job. The goal of this function is to organize and summarize the per-chromosome gen-files: saving as 1) a gen-file with probability-estimate-containing imputation data, and 2) a 'simple'-format file in which the calls are already made. The current threshold is 0.9, per the gtools default setting (but it is variable per use-case - PRS for example may be robust to lower thresholds). The key to this function is the extensive splitting that minimizes the memory footprint. This could otherwise become very large, particularly for the long chromosomes, and would prevent cheap(er) hub-to-node running.
(4) imputation_cron_job.R (a script file). This file is called as a cron-job. It checks what input-data is ready and then calls the two functions described above (2 and 3), as well as the per-module export scripts (7). The extra code in the file is used to determine if it should run as a hub-job or a node-job; node jobs are prefered, because these functions are computationally expensive. In a hub-only setup, calculations are run on the same computer as the web-site interface. In the hub+node-setup, a central hub computer stores data and shows the web-site, while a scalable number of node-computers perform all computationally heavy calculations.
(5) bulk_imputation_cron_job.R (a script file). Similar to the imputation_cron_job.R (4) file, this calls the cron-job driven functions for imputation and summary. The difference with this mode is that it processes in multiples of ten samples. This economises use of computer-power, at the cost of taking more time from data-submission to finished results.
(6) run_bulk_imputation (a function). Similar to run_imputation (3), but called from the bulk_imputation_cron_job.R script (5). It is however, followed by the regular summarize_imputation (3) because there is no CPU-saving potential in merging that aspect.
(7) export_script.R (several scripts, e.g. here and here) for each module, this script contains computationally intensive calculations used for pre-calculations of derived data. For details on each script, refer to the code in the module as well as section 2. The scripts are activated using the run_export_script-function (e.g. here).
- email and filename as linked to genetic data (non-linked email is saved elsewhere per terms-of-service)
Remaining data consist of all the derived data, such as disease-gene-scores. Although this includes some genotypes, it is not considered personally traceable because these genotypes cannot be used to track people systematically, e.g. using software like GEDmatch.