Skip to content

PHet: Heterogeneity-Preserving Discriminative Feature Selection for Subtype Discovery

License

Notifications You must be signed in to change notification settings

kleelab-bch/phet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Heterogeneity-Preserving Discriminative Feature Selection for Disease-Specific Subtype Discovery

Workflow

Basic Description

This repository encompasses a range of subtype detection algorithms, with a primary focus on the PHet (Preserving * Het* erogeneity) algorithm. The PHet algorithm is designed to conduct recurrent subsampling differential analysis and interquartile range (IQR) calculations between conditions, in order to pinpoint a minimal set of features that preserve heterogeneity while maximizing the quality of subtype clustering. Through the utilization of public datasets from microarray and single-cell RNA-seq studies, PHet has demonstrated its effectiveness in identifying disease subtypes, surpassing the performance of previous outlier-based methods. While this guide offers a tutorial on executing 25 different algorithms, it does not delve into an exhaustive description of every feature within the package. For additional information about arguments and functionalities, it is recommended to execute python main.py --help.

Dependencies

We highly recommend installing Anaconda which is an open source distribution of the Python and R programming languages for data wrangling, predictive analytics, and scientific computing. The codebase is tested to work under Python 3.11. To install the necessary requirements, run the following commands:

pip install -r requirements.txt

Basically, PHet requires following packages:

Test Samples

Two test datasets with their associated files are provided with this package:

  • A microarray SRBCT data:

    • "srbct_matrix.mtx": The the small, round blue-cell tumors expression dataset (83, 2308).
    • "srbct_feature_names.csv": The names of the features of SRBCT data (2308 features).
    • "srbct_classes.csv": Binary classes (0 or 1 ) of samples of the SRBCT data (83 samples).
    • "srbct_types.csv": The subtypes of SRBCT samples (83 samples). Four subtypes of small round blue cell tumors: Ewing's sarcoma (EWS), neuroblastoma (NB), rhabdomyosarcoma (RMS), and Burkitt's lymphoma (BL).
    • "srbct_deco_features.csv": Ranked features from DECO on the SRBCT data. Features are ranked based the DECO statistics (145, 2).
    • "srbct_limma_features.csv": Results of LIMMA to the SRBCT data. Features are ranked based on the B value, which measures the log-odds that a feature is differentially expressed (2308, 7).
  • A single cell transcriptomics HBECs data:

    • "hbecs_matrix.mtx": A reduced data from the human bronchial epithelial cells expression dataset (297, 25475).
    • "hbecs_feature_names.csv": The names of the features of HBECs data (25475 features).
    • "hbecs_classes.csv": Binary classes (0 or 1 ) of samples of the HBECs data (297 samples).
    • "hbecs_markers.csv": A predefined list of signatures (411 features).
    • "hbecs_types.csv": The subtypes of HBECs samples (297 samples). Two cell types: Basal and Ionocytes.
    • "hbecs_donors.csv": Three donors for the HBECs data (297 samples).

Please store the files in one directory for the best practice.

Installation and Basic Usage

Run the following commands to clone the repository to an appropriate location:

git clone https://github.com/kleelab-bch/phet

For all experiments, navigate to src folder then run the commands of your choice. For example, to display options use: python main.py --help. It should be self-contained. All the command arguments are initiated through main.py file. We provided examples on how to run experiments using the SRBCT data.

The description about arguments in the following examples are: --dspath: is the location to the dataset folder, --rspath: is the location to the result folder, --build-syn-dataset: a true/false variable suggesting whether to generate simulated data, --file-name: is the name of the input data, --suptitle-name: is the name of the suptitle of the figures, --control-name: is the name of the control group, --case-name: is the name of the case group, --methods: is a list of subtypes detection methods, --direction: is the direction of the test the hypothesis test, --iqr-range: is the range where percentiles would be computed on, --normalize: type of normalization to be applied, --q: is the percentile to compute,--dids-scoref: is the final function to compute features scores for DIDS scoring, --num-subsamples: the number of subsamples, --feature-weight: defines weights for binning intervals for PHet, --alpha: is the cutoff significance level, --score-metric: is the metric used for evaluation, --top-k-features: is the number of top features to be considered for evaluation and plotting, --plot-top-k-features: is the argument to plot UMAP of the data using top k features, --cluster-type: corresponds the the type of clustering algorithm, --export-spring: suggests to export related data for the SPRING plot, and --num-jobs: is the number of parallel workers.

Example 1

A list of algorithms can be applied at the same time. Here is a simple illustration of how this works:

python main.py --dspath [path to the folder containing data] --rspath [path to the folder containing results] --file-name "srbct" --suptitle-name "SRBCT" --control-name "Control" --case-name "Case" --methods ttest_g wilcoxon_g ks_g copa os ort most lsoss dids phet_br --direction "both" --iqr-range 25 75 --normalize "zscore" --q 75 --dids-scoref "tanh" --num-subsamples 1000 --feature-weight 0.4 0.3 0.2 0.1 --alpha 0.01 --score-metric "f1" --top-k-features 100 --cluster-type "kmeans" --num-jobs 2

For the --file-name argument, please include only the name of the data and remove the suffix _matrix.mtx. This will generate several files located in the rspath folder.

Example 2

To infer subtypes using LIMMA and DECO. First, you need to run LIMMA and DECO then store the features in .csv format with appropriate suffixes. Here, we show an example of how to get subtypes using features (srbct_limma_features.csv & srbct_deco_features.csv) from these algorithms:

python main.py --dspath [path to the folder containing data] --rspath [path to the folder containing results] --file-name "srbct" --suptitle-name "SRBCT" --control-name "Control" --case-name "Case" --methods limma_g deco --alpha 0.01 --score-metric "f1" --top-k-features 100 --cluster-type "kmeans" --num-jobs 2

For the --file-name argument, please include only the name of the data and remove the suffix _matrix.mtx. This will generate several files located in the rspath folder.

Example 3

To export file for the SPRING plot, enable the argument --export-spring. Here, we run the PHet (with IQR) model using the HBECs data:

python main.py --dspath [path to the folder containing data] --rspath [path to the folder containing results] --file-name "hbecs" --suptitle-name "Basal vs Ionocytes" --control-name "Basal" --case-name "Ionocytes" --methods phet_br --export-spring --iqr-range 25 75 --normalize "zscore" --num-subsamples 1000 --feature-weight 0.4 0.3 0.2 0.1 --alpha 0.01 --score-metric "f1" --top-k-features 100 --cluster-type "kmeans" --num-jobs 2

For the --file-name argument, please include only the name of the data and remove the suffix _matrix.mtx. This will generate several files located in the rspath folder.

Citing

If you find PHet useful in your research, please consider citing the following paper:

Contact

For any inquiries, please contact: ar.basher@childrens.harvard.edu

Releases

No releases published

Packages

No packages published