Illumina BeadArray Platform Detection
We're interested in automatically detecting platform/annotation based on the Illumina expression beadchip data that is publicly available. Some background discussion here. This repository contains some preliminary analyses of human platforms.
Data and methodology overview
We're looking at Illumina Human "whole genome" chips. (These chips have additional transcripts beyond what are on the "Ref" platforms.)
These are the GEO accessions we're using (from the linked issue above):
|Platform Name||GEO Accession|
data/series_lists from each of these platforms were obtained through the GEO Browser on May 17, 2018. The platform accession was used as a search term -> click on series -> Export -> All Search results & Tab.
For each platform, we randomly selected 30 series that had supplementary txt files (e.g., there was a chance that there was
non-normalized.txt files were available).
We downloaded supplementary files that matched this pattern:
In this sample, newer platforms had more accessions with raw data (as defined by this pattern; see table below).
List of probes and calculating overlap
For each series, we calculated what proportion of identifiers were in each of the lists of probes and the proportion of probes that were in the identifiers. Identifiers were (naively) assumed to be in the first column, as this is generally consistent with GEO instructions.
The newer platforms (v2 and beyond) have some overlap in their identifiers. However, for most series, the identifiers from the data had the highest overlap with the platform for which it was labeled. There were some exceptions.
We highlight the findings from the series labeled
GPL6947 (HumanHT-12 v3.0 ) below.
Fig 1. Overlap between identifiers from series labeled
GPL6947 and the four platform Bioconductor packages.
For most series (shown in black), we see the highest overlap with
Humanv3 and some amount of overlap with
This is consistent with what we would expect based on the relationships between the platforms themselves.
However, some series contained multiple Illumina platforms (and were not SuperSeries), which we would not have detected given our methodology.
Once we color points from different platforms, we see that they behave as expected.
GPL10558 is consistent with the pattern of most
GPL10558 (HumanHT-12 v4) series.
The identifiers from
GPL6104 (HumanRef-8 v2.0) have the highest overlap with
Humanv2 and less than 50% of the probes from
v4 are in the series identifiers, consistent with the smaller subset of transcripts present on Ref chips.
Below, we describe series with overlaps that deviated from this pattern for other platforms.
GPL2507 (Human-6 v1.0)
GSE17241contains multiple platforms. The
GPL6106raw data appears to use bead IDs rather than probe IDs.
GPL6102 (Human-6 v2.0)
GSE14295uses gene symbols rather than probe IDs. We'd be unable to get the probe sequences, which are required for processing, for this experiment.
GSE35102looks like it might be could be WG-6
v2filtered to only probes that are also present on the
v3chip. This would not matter for processing, as we'd be able to obtain gene identifiers and probe sequences for all the probes.
PROBE_IDcolumn that is not the first column, so we missed it using this method.