This project contains a script and the resulting data for collecting and analyzing metadata from FlowRepository. The primary purpose for this is to collect FCS metadata that isn't otherwise readily available. For example, the data and code here is good for answering questions like:
- What datasets have at least X of N possible intracellular or surface markers?
- What FCS files available across the whole of FlowRepository have the largest panels?
- What published datasets are most similar to another dataset of interest?
The logic for data collection exists in build.R, which downloads the first 32k of each FCS file, parses the header, and stores channel names along with other dataset and file level metadata. The biggest challenge in using this information is normalizing all the different parameter naming conventions into a common nomenclature. An effort is made in this project to do this by using an HGNC lookup table (from HGNChelper) as well as parsing and matching logic in resolve.R. At TOW, this resolves cases like this:
|Channel or Parameter Name||Result|
|CD235A BIOTIN/QD 585||CD235A|
You get the picture. The matching isn't perfect but using typical name delimiters like
[./ -#] and joining the resulting terms to a master lookup table resolves most cases, and gives preference to CD designations for the sake of cross-dataset comparison. There is also a manual list of mappings in data/param_map_edits.csv that can be altered to remap with new manual additions (or with changes made to resolve.R). This list was built with a focus on T cell markers so more entries might be necessary to reach a good normalization for other kinds of studies.