Statistical Classifiers for Data that Do Not Conform to Rows and Columns: A Case Study of T-cell Receptor Datasets
JARED L OSTMEYER, ASSISTANT PROFESSOR, UT SOUTHWESTERN DEPARTMENT OF POPULATION AND DATA SCIENCES
Most statistical classifiers are designed to find patterns in data where numbers fit into rows and columns, as in a spreadsheet. However, many forms of data cannot be structured in this manner. Sequences, like in a text document, cannot be arranged as numbers in a spreadsheet. As is the case with all sequences, both the content and the order of each symbol in the sequence provides important information. To uncover patterns in sequences and other nonconforming data, it is important to consider not only the content of the data but the way that data is structured.
Computer science provides language for describing different ways non-conforming features can be structured. For example, biological data is often structured as a sequence of symbols. The essential property of a sequence is that both the content and order of symbols provide important information. Sets represent another way data can be structured. A set is like a sequence except the order of the symbols does not encode information. Identifying the underlying structure of the features is important because disparate sources of data with the same underlying structure can be handled with the same methods. For example, a social network is a graph of people and the way people are connected just like a molecule is a graph of atoms and the way those atoms are connected, allowing for the same statistical classifier to handle both cases.
To understand different ways data can be structured, we consider two datasets of T-cell receptors, anticipating these datasets to contain signatures for diagnosing disease. The following two T-cell receptor datasets are examples of non-conforming data.
10XGenomics has published a dataset of sequenced T-cell receptors labelled by interaction with disease particle, which are called antigens (link). We refer to this as the antigen classification dataset. Adaptive Biotechnologies has published a separate dataset of patients' sequenced T-cell receptors, which are called immune repertoires, labelled by those patients' CMV status (link). We refer to this as the repertoire classification dataset. Training data is used to fit a model, validation data is used for model selection, and test data is for reporting results. All results on test data must be reported.
- Download: zip
git clone https://github.com/jostmey/dkm