Generic Dimension Reduction Tool

The source code and this README are generated by Gemini 3 Pro Preview.

Generic Dimension Reduction Tool

A flexible, robust Python script for performing dimensionality reduction on tabular data. It supports PCA, t-SNE, and UMAP, handling various input formats including gzipped files, row/column-major orientations, and automatic header detection.

Features

Algorithms: PCA, t-SNE, and UMAP.
Input Flexibility: Reads TAB-delimited text (.txt, .tsv) and Gzip compressed files (.gz) seamlessly.
Orientation Support:
- Row-major (Standard ML): Rows are samples, columns are features.
- Column-major (Genomics/Bioinformatics): Columns are samples, rows are features (e.g., gene expression matrices).
Robust Parsing:
- Automatically detects headers.
- Handles "R-format" files (where the header is missing the ID column name).
- Skips non-data description columns.
Preprocessing:
- Automatic Z-score normalization (StandardScaler).
- Pre-PCA: Optional step to reduce dimensions via PCA before running UMAP/t-SNE to improve speed and denoising on high-dimensional data.

Dependencies

Requires Python 3 and the following packages:

pip install numpy scikit-learn umap-learn

Note: umap-learn is only required if you intend to use the -a umap option.

Usage

python dim_reduce.py [input_file] [options]

Arguments

Argument	Description
Input
`input_file`	Path to the input TAB-delimited file (can be `.gz`).
`-o`, `--output`	Output filename. Defaults to Standard Output (stdout).
Parsing
`-c`, `--col-major`	Column-Major Mode. Switch to this if your samples are columns (e.g., single-cell matrices). Default is Row-Major (rows are samples).
`-s`, `--skip`	Number of description/ID columns to skip at the start of each row. Default: `1`.
Algorithm
`-a`, `--algo`	Algorithm to use: `pca` (default), `umap`, or `tsne`.
`-n`, `--n-components`	Number of dimensions to output (e.g., 2 or 3). Default: `2`.
Processing
`--pre-pca INT`	Run PCA first to reduce to `INT` dimensions before running UMAP/t-SNE. Useful for very high-dim inputs (e.g., `--pre-pca 50`).
`--no-scale`	Disable Z-score normalization (scaling).
`--no-eigen`	(PCA only) Output raw scores instead of normalized transformed coordinates.
Hyperparameters
`--perplexity`	Perplexity for t-SNE. Default: `30.0`.
`--neighbors`	Number of neighbors for UMAP. Default: `15`.

Input Formats

1. Row-Major (Default)

Each row is a data point (sample). The first -s columns are IDs/Descriptions.

SampleID    Feature1    Feature2    Feature3
Sample_A    1.2         0.5         3.3
Sample_B    0.9         0.1         2.1

2. Column-Major (`-c`)

Each column is a data point (sample). Rows are features (genes, metrics). This is common in bioinformatics.

GeneID      Sample_A    Sample_B    Sample_C
Gene_1      1.2         0.9         1.5
Gene_2      0.5         0.1         0.4

Note: In Column-Major mode, the script expects sample names in the header. If the file is in "R-format" (header length = data length - 1), the script automatically adjusts alignment.

Examples

1. Basic PCA on a standard CSV/TSV Rows are samples. Skip the first column (ID).

python dim_reduce.py data.txt -o result_pca.txt

2. UMAP on Gene Expression Data (Column-Major) Samples are columns. The file is gzipped. We want to skip the first 2 columns (e.g., GeneID and GeneSymbol).

python dim_reduce.py expression.tab.gz -c -s 2 -a umap -o result_umap.txt

3. High-Performance t-SNE For a very large dataset (e.g., 20k features), use --pre-pca to reduce to 50 dimensions before running t-SNE.

python dim_reduce.py big_data.txt -a tsne --pre-pca 50 --perplexity 50 -o result_tsne.txt

Output Format

The output is a simple TAB-delimited file containing the Sample ID and the coordinates.

SampleID    UMAP_1      UMAP_2
Sample_A    3.412301    -1.203910
Sample_B    1.902311    2.401293
...

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
dim-reduce.py		dim-reduce.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generic Dimension Reduction Tool

Features

Dependencies

Usage

Arguments

Input Formats

1. Row-Major (Default)

2. Column-Major (`-c`)

Examples

Output Format

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Generic Dimension Reduction Tool

Features

Dependencies

Usage

Arguments

Input Formats

1. Row-Major (Default)

2. Column-Major (-c)

Examples

Output Format

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages

2. Column-Major (`-c`)