The source code and this README are generated by Gemini 3 Pro Preview.
A flexible, robust Python script for performing dimensionality reduction on tabular data. It supports PCA, t-SNE, and UMAP, handling various input formats including gzipped files, row/column-major orientations, and automatic header detection.
- Algorithms: PCA, t-SNE, and UMAP.
- Input Flexibility: Reads TAB-delimited text (
.txt,.tsv) and Gzip compressed files (.gz) seamlessly. - Orientation Support:
- Row-major (Standard ML): Rows are samples, columns are features.
- Column-major (Genomics/Bioinformatics): Columns are samples, rows are features (e.g., gene expression matrices).
- Robust Parsing:
- Automatically detects headers.
- Handles "R-format" files (where the header is missing the ID column name).
- Skips non-data description columns.
- Preprocessing:
- Automatic Z-score normalization (StandardScaler).
- Pre-PCA: Optional step to reduce dimensions via PCA before running UMAP/t-SNE to improve speed and denoising on high-dimensional data.
Requires Python 3 and the following packages:
pip install numpy scikit-learn umap-learnNote: umap-learn is only required if you intend to use the -a umap option.
python dim_reduce.py [input_file] [options]| Argument | Description |
|---|---|
| Input | |
input_file |
Path to the input TAB-delimited file (can be .gz). |
-o, --output |
Output filename. Defaults to Standard Output (stdout). |
| Parsing | |
-c, --col-major |
Column-Major Mode. Switch to this if your samples are columns (e.g., single-cell matrices). Default is Row-Major (rows are samples). |
-s, --skip |
Number of description/ID columns to skip at the start of each row. Default: 1. |
| Algorithm | |
-a, --algo |
Algorithm to use: pca (default), umap, or tsne. |
-n, --n-components |
Number of dimensions to output (e.g., 2 or 3). Default: 2. |
| Processing | |
--pre-pca INT |
Run PCA first to reduce to INT dimensions before running UMAP/t-SNE. Useful for very high-dim inputs (e.g., --pre-pca 50). |
--no-scale |
Disable Z-score normalization (scaling). |
--no-eigen |
(PCA only) Output raw scores instead of normalized transformed coordinates. |
| Hyperparameters | |
--perplexity |
Perplexity for t-SNE. Default: 30.0. |
--neighbors |
Number of neighbors for UMAP. Default: 15. |
Each row is a data point (sample). The first -s columns are IDs/Descriptions.
SampleID Feature1 Feature2 Feature3
Sample_A 1.2 0.5 3.3
Sample_B 0.9 0.1 2.1
Each column is a data point (sample). Rows are features (genes, metrics). This is common in bioinformatics.
GeneID Sample_A Sample_B Sample_C
Gene_1 1.2 0.9 1.5
Gene_2 0.5 0.1 0.4
Note: In Column-Major mode, the script expects sample names in the header. If the file is in "R-format" (header length = data length - 1), the script automatically adjusts alignment.
1. Basic PCA on a standard CSV/TSV Rows are samples. Skip the first column (ID).
python dim_reduce.py data.txt -o result_pca.txt2. UMAP on Gene Expression Data (Column-Major) Samples are columns. The file is gzipped. We want to skip the first 2 columns (e.g., GeneID and GeneSymbol).
python dim_reduce.py expression.tab.gz -c -s 2 -a umap -o result_umap.txt3. High-Performance t-SNE
For a very large dataset (e.g., 20k features), use --pre-pca to reduce to 50 dimensions before running t-SNE.
python dim_reduce.py big_data.txt -a tsne --pre-pca 50 --perplexity 50 -o result_tsne.txtThe output is a simple TAB-delimited file containing the Sample ID and the coordinates.
SampleID UMAP_1 UMAP_2
Sample_A 3.412301 -1.203910
Sample_B 1.902311 2.401293
...