# ISCB-Africa ASBCB 2025

## Lagoon Beach Hotel & Conference Center, Cape Town, South Africa

**Date of the session:** April 10, 2025

**Instructor(s)/Affiliation(s):** Loni Taylor, Meharry Medical College, Nashville, TN, USA. Bishnu Sarker, Meharry Medical College, Nashville, TN, USA. Animesh Acharjee, University of Birmingham, UK.

## **Non-negative Matrix Factorization (NMF) for Multiomics Data Integration**


### Overview:
This section will dive deeper into Non-negative Matrix Factorization (NMF), explaining how it can be used to decompose complex multiomics datasets into meaningful components. NMF’s role in identifying latent biological factors and reducing the complexity of data will be discussed in the context of multiomics.

#### Topics:

+ Principles of NMF and its applications in biological data
+ How NMF helps in dimensionality reduction and feature extraction
+ Sample demonstration using NMF in omics data analysis

### Non-negative Matrix Factorization (NMF):

NMF is a matrix factorization technique that helps uncover hidden patterns in complex datasets by decomposing data into non-negative components. NMF decomposes a non-negative matrix into two non-negative matrices—one representing the relationship between samples and the latent components, and the other representing the relationship between the latent components and the original features.

It is particularly useful in multiomics, where it can identify latent factors that represent biological processes shared across different layers of omic data. By applying NMF, we can reduce the dimensionality of multi-omics data, making it easier to analyze and visualize complex biological relationships. In the context of multiomics data, NMF can uncover shared patterns across different omics layers, reducing the complexity of the data and enabling further analysis such as clustering, classification, and feature extraction.



### Key Characteristics of NMF:

**1. Unsupervised Learning:** NMF is unsupervised because it doesn't require labeled data. It seeks to find hidden patterns or structures in the data based on the data's intrinsic properties.

**2. Factorization:** NMF works by factorizing a non-negative matrix 𝑉 into two smaller non-negative matrices 𝑊 and 𝐻, such that:

                        𝑉 ≈ 𝑊 × 𝐻

+ 𝑉 is the original data matrix (for example, gene expression values or document-term matrix).

+ 𝑊 contains the "basis" or "components," often interpreted as the features or patterns.

+ 𝐻 contains the "coefficients" or "activations," showing how strongly each component is present in the original data.

**3. Non-Negativity Constraint:** The key feature of NMF is that it enforces non-negative values in the factorization. This is useful in biological data or image processing, where negative values may not make sense or may not have an interpretable meaning.

**4. Dimensionality Reduction:** NMF is often used for reducing the dimensionality of the data while maintaining interpretability. The lower-dimensional representations (in 𝑊 and 𝐻) can reveal latent patterns or clusters in the data.


### Applications:

+ Gene Expression Data: NMF is commonly used to uncover hidden patterns in gene expression data, identifying gene signatures or latent biological processes.

+ Text Mining: In natural language processing, NMF can be applied to decompose document-term matrices to uncover topics (similar to Latent Dirichlet Allocation, LDA).

+ Image Processing: NMF is also used for image compression and feature extraction, where the matrix could represent pixel intensities.

+ Multiomics Integration: NMF can be applied to integrate and uncover common features in multiple omics layers (e.g., genomics, transcriptomics, and proteomics).

The following remainder of the notebook demonstrates how to implement NMF for integrating multiomics data, such as RNA expression, methylation, and protein expression data.
Download and follow this code notebook to see a simple process of integrating and analyzing multiomics data in python.


## Implementing Non-negative Matrix Factorization (NMF)

We begin with our basic library imports. Pandas, numpy and scikit-learn for our data analysis, scientific computing, and machine learning needs.

#### Pandas

+ Purpose: Data manipulation and analysis.

+ Key Features:

    + Provides DataFrame and Series objects for handling data.

    + Easy handling of missing data.

    + Supports merging, grouping, and reshaping data.

    + Ideal for data cleaning, transformation, and exploration.

+ Common Use: Loading data from various sources (CSV, Excel, SQL), data wrangling (cleaning, reshaping, etc.), and exploratory data analysis (EDA).

#### NumPy
+ Purpose: Numerical computations and array manipulation.

+ Key Features:

    + Provides support for multi-dimensional arrays and matrices.

    + Fast array operations (vectorized operations).

    + Supports linear algebra, random sampling, and Fourier transforms.

+ Common Use: High-performance operations on large arrays or matrices, scientific computing, and numerical analysis.

#### Scikit-learn
+ Purpose: Machine learning.

+ Key Features:

    + A library for supervised and unsupervised learning.

    + Provides tools for model training, evaluation, and cross-validation.

    + Supports a wide range of algorithms, including classification, regression, clustering, and dimensionality reduction.

    + Includes utilities for preprocessing, such as scaling, encoding, and splitting data.

+ Common Use: Building and evaluating machine learning models, performing tasks like classification, regression, clustering, and feature selection.

In [1]:
import numpy as np
import pandas as pd
from sklearn.decomposition import NMF
from sklearn.preprocessing import MinMaxScaler


In [2]:
# Sample multiomics data (replace with your actual data/filepath)
# Each row represents a gene, and each column represents a sample
# Data should be non-negative

data = pd.read_csv('filepath/CLL_data_mRNA.csv')
data2 = pd.read_csv('filepath/CLL_data_Methylation.csv')
#data_rna = pd.DataFrame(np.random.rand(136, 50), index=['gene_{}'.format(i) for i in range(136)])
data_rna = pd.DataFrame(data, index=['gene_{}'.format(i) for i in range(136)])
#data_methylation = pd.DataFrame(np.random.rand(136, 50), index=['gene_{}'.format(i) for i in range(136)])
data_methylation = pd.DataFrame(data2, index=['gene_{}'.format(i) for i in range(136)])
data_protein = pd.DataFrame(np.random.rand(136, 50), index=['gene_{}'.format(i) for i in range(136)])
#data_protein = pd.DataFrame(data, index=['gene_{}'.format(i) for i in range(len(data.columns))])

In [3]:
###potential nans can be dropped before here:
# Drop columns with any NaN values
data_rna = data_rna.dropna(axis=1)
data_methylation = data_methylation.dropna(axis=1)
data_protein = data_protein.dropna(axis=1)

In [4]:
# Combine multiomics data
data_combined = pd.concat([data_rna, data_methylation, data_protein], axis=1)

#### Explanation:
<br>
Data Preparation:
<br>
The example data represents multi-omics data for RNA expression, methylation, and protein levels. These datasets are combined into a single matrix (data_combined), where rows represent genes and columns represent samples. Replace the sample data with your actual multiomics data.

Normalization:<br>
Since NMF requires non-negative values, it's crucial to scale the data. We use MinMaxScaler to normalize the data within the range [0, 1].


In [5]:
# Data normalization
scaler = MinMaxScaler()
data_normalized = scaler.fit_transform(data_combined)
data_normalized = pd.DataFrame(data_normalized, index=data_combined.index, columns=data_combined.columns)

Applying NMF:<br>
The NMF algorithm is applied with the specified number of components (in this case, 10). The model decomposes the normalized data matrix into two non-negative matrices: 
+ W (sample-component matrix)<br> 
and
+ H (component-feature matrix).

In [6]:
# Apply NMF
n_components = 10  # Number of components to extract
nmf = NMF(n_components=n_components, init='nndsvda', random_state=0, max_iter=500)
W = nmf.fit_transform(data_normalized)
H = nmf.components_



W captures the association between samples and NMF components (latent factors), while H captures the contribution of each original feature to each NMF component.

In [7]:
# Create dataframes for W and H
W_df = pd.DataFrame(W, index=data_normalized.index, columns=['NMF_{}'.format(i) for i in range(n_components)])
H_df = pd.DataFrame(H, index=['NMF_{}'.format(i) for i in range(n_components)], columns=data_normalized.columns)


In [8]:
# Print results
print("W (Sample-component matrix):")
print(W_df.head())
print("\nH (Component-feature matrix):")
print(H_df.head())

W (Sample-component matrix):
           NMF_0     NMF_1     NMF_2     NMF_3     NMF_4     NMF_5     NMF_6  \
gene_0  0.018561  0.000000  0.071710  0.074608  0.030406  0.000000  0.073465   
gene_1  0.010157  0.001872  0.080459  0.103289  0.101455  0.000000  0.074791   
gene_2  0.009635  0.002250  0.086954  0.079768  0.015968  0.101549  0.062943   
gene_3  0.009095  0.001158  0.103489  0.000000  0.290887  0.138927  0.000000   
gene_4  0.005647  0.000853  0.149064  0.089072  0.128870  0.067688  0.136676   

           NMF_7     NMF_8     NMF_9  
gene_0  0.435539  0.234751  0.108511  
gene_1  0.186611  0.051346  0.042895  
gene_2  0.013761  0.000000  0.360593  
gene_3  0.146690  0.404938  0.123772  
gene_4  0.000000  0.170508  0.000000  

H (Component-feature matrix):
             0          1         2          3         4          5   \
NMF_0  9.080151   0.000000  0.000000   0.000000  0.000000   0.000000   
NMF_1  5.832292  32.436065  0.000000  22.605456  3.197621  55.064605   
NMF_2  1.

Results Interpretation:<br>
+ The W_df matrix shows how each sample relates to the underlying NMF components.
+ The H_df matrix shows the importance of each original feature (e.g., genes, metabolites, proteins) in constructing each NMF component.

These decomposed matrices (W and H) can be used for further analysis like clustering, identifying latent biological factors, or associating with phenotypic data such as disease status or treatment response.

#### Potential Use Cases for NMF in Multiomics Data:
+ Dimensionality Reduction: Simplify complex multiomics datasets by focusing on the most important components.
+ Pattern Discovery: Uncover hidden biological patterns across multiple omic layers, such as common pathways or gene-metabolite interactions.
+ Feature Extraction: Identify key features (genes, proteins, metabolites) that drive biological processes or disease mechanisms.

By using NMF, we can achieve a more integrative understanding of biological data, ultimately paving the way for improved disease classification, biomarker discovery, and drug response prediction.


## Using_NMF_for_Biomarker_Discovery

Building upon the previous section, this part will show how NMF can be applied to biomarker discovery. The audience will learn how to use NMF for identifying potential biomarkers that can be used for disease diagnosis, prognosis, or therapeutic targeting.

### Topics:

+ Biomarker discovery and its importance in precision medicine

+ Practical applications of NMF in biomarker identification

+ Hands-on example: Using NMF for identifying biomarkers in multiomics data



1. **What Are Biomarkers?**
    + Definition: Biomarkers are measurable indicators of a biological state or condition—such as specific genes, proteins, or metabolites—that can be used to detect or monitor diseases, predict treatment response, or classify disease subtypes.

    + Examples:

        + HER2 expression in breast cancer

        + PSA levels in prostate cancer

        + Blood glucose levels as a biomarker for diabetes


2. **The Role of Multiomics in Biomarker Discovery**
    + Multiomics data provide a comprehensive view of the molecular landscape, enabling the discovery of biomarkers not just at the gene level, but also at the transcript, protein, and metabolite levels.

    + By integrating these data layers, researchers can identify multi-modal biomarkers—combinations of omics features that jointly provide more predictive power than any single omic alone.


3. **Why Use NMF for Biomarker Discovery?**
    + Pattern Discovery: NMF excels at discovering co-expressed or co-regulated feature groups by breaking down complex data into additive parts, which can highlight biologically relevant patterns.

    + Interpretability: Unlike many machine learning methods, NMF produces interpretable, sparse factors—each component can be associated with a specific biological process or sample subtype.

    + Feature Selection: The learned components from NMF can be analyzed to select features (e.g., genes or proteins) that contribute most to distinguishing disease vs. healthy samples, or to specific subtypes of a condition.


4. **Step-by-Step: Applying NMF to Identify Biomarkers (this was done in our code example earlier)**
    + Preprocessing:
        + Normalize omics data (e.g., gene expression matrix)
        + Filter low-variance features to reduce noise

    + Apply NMF:
        + Decompose the data matrix into two non-negative matrices: 𝑊 (feature matrix) and 𝐻 (sample matrix)
        + Choose number of components (k), which might represent biological processes or patient subtypes

    + Interpret Components:
        + Analyze 𝑊: Identify top contributing genes/proteins/metabolites for each component
        + Analyze 𝐻: Cluster samples based on their component profiles

    + Biomarker Selection:
        + Extract top-ranked features (e.g., top genes per component) as candidate biomarkers
        + Validate them using external datasets or functional annotation databases

### Demonstration:

In [10]:
# Assuming H_df from the first code is used for biomarker discovery
# H_df is the component-feature matrix (rows = components, columns = features)

# Function to identify top biomarkers for each NMF component
def discover_biomarkers(H_df, top_n=10):
    biomarkers = {}
    
    # Loop through each component and find the top features
    for component in H_df.index:
        # Sort the features by the weight (importance) in descending order
        sorted_features = H_df.loc[component].sort_values(ascending=False)
        
        # Select the top 'n' biomarkers for this component
        top_biomarkers = sorted_features.head(top_n)
        
        biomarkers[component] = top_biomarkers
        
    return biomarkers

# Discover top biomarkers for each component
top_biomarkers = discover_biomarkers(H_df, top_n=10)



Case Study: Using omics data from cancer patients, NMF identified the top biomarkers.

Biomarker Discovery: The function *discover_biomarkers* processes the H_df matrix, which has the weights of each feature (gene, protein, etc.) for each NMF component. By sorting the weights for each component, we can identify the features most associated with that component.

Top Biomarkers: For each NMF component, we select the top n features (genes, proteins) that have the highest weights. These are considered as the biomarkers for the corresponding component.

Output: The top biomarkers for each component are printed. These are the features most strongly associated with each identified latent factor.

These genes can then become candidate biomarkers for use in a prediction model.

In [11]:
# Print the top biomarkers for each component
for component, biomarkers_list in top_biomarkers.items():
    print(f"Component {component} - Top Biomarkers:")
    print(biomarkers_list)
    print("\n")

Component NMF_0 - Top Biomarkers:
31    18.698089
25    17.675052
14    16.417682
20    15.605661
44    15.264663
34    13.821380
35    11.587315
10    11.498126
8     10.820830
43    10.783739
Name: NMF_0, dtype: float64


Component NMF_1 - Top Biomarkers:
40    79.179383
14    76.842831
22    76.084987
6     75.336558
13    68.157619
48    68.018119
34    64.137355
39    56.616922
26    55.756656
21    55.441853
Name: NMF_1, dtype: float64


Component NMF_2 - Top Biomarkers:
5     4.633690
4     2.367231
30    2.209117
32    2.055639
37    2.013619
20    1.987522
39    1.666948
33    1.659968
43    1.571705
23    1.521520
Name: NMF_2, dtype: float64


Component NMF_3 - Top Biomarkers:
16    6.320379
6     3.932294
44    3.557771
9     3.043414
12    2.958299
47    2.634916
17    2.389752
3     2.235459
45    2.159968
31    2.141935
Name: NMF_3, dtype: float64


Component NMF_4 - Top Biomarkers:
7     2.526752
38    2.225240
28    1.847374
45    1.729980
32    1.537031
46    1.448975


#### Applying Machine Learning for Multiomics Data Integration

Next, we will implement predictive models based on our results.  This will be done using a Random Forest Classifier over our data set.

We will use the W_df matrix (which is the output from NMF representing the transformed features in the component space).

For this demonstration we assume that y (the diagnosis labels) is available. In real scenarios, this would be extracted from a corresponding label set, but for simplicity, we’ll assume it’s already aligned with the rows of W_df.

Then have the machine learning pipeline to use W_df for X, and we’ll assume that y (labels) are provided or already available.

To feed these biomarkers into a prediction model, after applying NMF, the new features (from the matrix 𝑊 in the first code, which represent the samples' projections onto the NMF components) should be used as the input features for the machine learning model below.

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

Data Preparation:<br>
The dataset is loaded using pandas.<br>
The target variable is separated from the features (X), and the dataset is split into training and testing sets using train_test_split.

In [13]:
# Assuming W_df from the first code is used as X (features)
# Create some dummy labels for demonstration purposes (replace with actual labels)
# Ensure the length of labels matches the number of rows in W_df
y = pd.Series([1 if i % 2 == 0 else 0 for i in range(W_df.shape[0])], name='diagnosis')

# Prepare the data
X = W_df  # The matrix W_df (transformed features from NMF)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Model Training:

In [14]:
# Train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)

A Random Forest Classifier model is trained on the training data (X_train and y_train).

The n_estimators = 100 argument specifies the number of decision trees to train within the random forest.

In [15]:
model.fit(X_train, y_train)

Model Evaluation:<br>
The trained model is then used to predict the class labels for the test set (X_test), and the predictions are compared with the true labels (y_test).

In [16]:
# Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

The accuracy of the model is computed using accuracy_score, and a more detailed classification report (precision, recall, F1-score) is generated using classification_report.

+ accuracy_score(y_test, y_pred): This calculates the overall accuracy of the model by comparing predicted values (y_pred) with actual values (y_test).
+ classification_report(y_test, y_pred): This provides a more detailed performance evaluation, showing precision, recall, F1-score, and support for each class.



The output includes:
1.	Accuracy: The overall accuracy of the classifier (i.e., the percentage of correct predictions).
2.	Classification Report: A detailed breakdown of the model's performance:
    + Precision: The proportion of true positives (correct disease predictions) among all predicted positives.
    + Recall: The proportion of true positives among all actual positives (diseased samples).
    + F1-score: The harmonic mean of precision and recall.
    + Support: The number of samples in each class.


In [17]:
print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')

Accuracy: 0.35714285714285715
Classification Report:
              precision    recall  f1-score   support

           0       0.26      0.56      0.36         9
           1       0.56      0.26      0.36        19

    accuracy                           0.36        28
   macro avg       0.41      0.41      0.36        28
weighted avg       0.46      0.36      0.36        28



##### Limitations and Considerations:
Choosing the Right k: The number of components must be selected carefully—too few can underrepresent complexity, too many can overfit.

Noise Sensitivity: NMF can be sensitive to data quality; pre-processing steps are crucial.

Validation: Biomarkers identified computationally should be experimentally validated or confirmed in independent datasets.

Of course with adjustments, you can increase the accuracy via selected methods such as hyperparameter tuning or feature importance to identify the most influential features in predicting the disease.