# ISCB-Africa ASBCB 2025

## Lagoon Beach Hotel & Conference Center, Cape Town, South Africa

**Date of the session:** April 10, 2025

**Instructor(s)/Affiliation(s):** Loni Taylor, Meharry Medical College, Nashville, TN, USA. Bishnu Sarker, Meharry Medical College, Nashville, TN, USA. Animesh Acharjee, University of Birmingham, UK.

## 3a: Implementing Non-negative Matrix Factorization (NMF)

### Overview:
This section will dive deeper into Non-negative Matrix Factorization (NMF), explaining how it can be used to decompose complex multiomics datasets into meaningful components. NMF’s role in identifying latent biological factors and reducing the complexity of data will be discussed in the context of multiomics.

#### Topics:

+ Principles of NMF and its applications in biological data
+ How NMF helps in dimensionality reduction and feature extraction
+ Sample demonstration using NMF in omics data analysis

### Non-negative Matrix Factorization (NMF):

NMF is a matrix factorization technique that helps uncover hidden patterns in complex datasets by decomposing data into non-negative components. NMF decomposes a non-negative matrix into two non-negative matrices—one representing the relationship between samples and the latent components, and the other representing the relationship between the latent components and the original features.

It is particularly useful in multiomics, where it can identify latent factors that represent biological processes shared across different layers of omic data. By applying NMF, we can reduce the dimensionality of multi-omics data, making it easier to analyze and visualize complex biological relationships. In the context of multiomics data, NMF can uncover shared patterns across different omics layers, reducing the complexity of the data and enabling further analysis such as clustering, classification, and feature extraction.



### Key Characteristics of NMF:

**1. Unsupervised Learning:** NMF is unsupervised because it doesn't require labeled data. It seeks to find hidden patterns or structures in the data based on the data's intrinsic properties.

**2. Factorization:** NMF works by factorizing a non-negative matrix 𝑉 into two smaller non-negative matrices 𝑊 and 𝐻, such that:

                        𝑉 ≈ 𝑊 × 𝐻

+ 𝑉 is the original data matrix (for example, gene expression values or document-term matrix).

+ 𝑊 contains the "basis" or "components," often interpreted as the features or patterns.

+ 𝐻 contains the "coefficients" or "activations," showing how strongly each component is present in the original data.

**3. Non-Negativity Constraint:** The key feature of NMF is that it enforces non-negative values in the factorization. This is useful in biological data or image processing, where negative values may not make sense or may not have an interpretable meaning.

**4. Dimensionality Reduction:** NMF is often used for reducing the dimensionality of the data while maintaining interpretability. The lower-dimensional representations (in 𝑊 and 𝐻) can reveal latent patterns or clusters in the data.


### Applications:

+ Gene Expression Data: NMF is commonly used to uncover hidden patterns in gene expression data, identifying gene signatures or latent biological processes.

+ Text Mining: In natural language processing, NMF can be applied to decompose document-term matrices to uncover topics (similar to Latent Dirichlet Allocation, LDA).

+ Image Processing: NMF is also used for image compression and feature extraction, where the matrix could represent pixel intensities.

+ Multiomics Integration: NMF can be applied to integrate and uncover common features in multiple omics layers (e.g., genomics, transcriptomics, and proteomics).

The following remainder of the notebook demonstrates how to implement NMF for integrating multiomics data, such as RNA expression, methylation, and protein expression data.
Download and follow this code notebook to see a simple process of integrating and analyzing multiomics data in python.


### **Applying Non-negative Matrix Factorization (NMF) for Multiomics Data Integration**


#### Implementing NMF for Multiomics Integration

We begin with our basic library imports. Pandas, numpy and scikit-learn for our data analysis, scientific computing, and machine learning needs.

#### Pandas

+ Purpose: Data manipulation and analysis.

+ Key Features:

    + Provides DataFrame and Series objects for handling data.

    + Easy handling of missing data.

    + Supports merging, grouping, and reshaping data.

    + Ideal for data cleaning, transformation, and exploration.

+ Common Use: Loading data from various sources (CSV, Excel, SQL), data wrangling (cleaning, reshaping, etc.), and exploratory data analysis (EDA).

#### NumPy
+ Purpose: Numerical computations and array manipulation.

+ Key Features:

    + Provides support for multi-dimensional arrays and matrices.

    + Fast array operations (vectorized operations).

    + Supports linear algebra, random sampling, and Fourier transforms.

+ Common Use: High-performance operations on large arrays or matrices, scientific computing, and numerical analysis.

#### Scikit-learn
+ Purpose: Machine learning.

+ Key Features:

    + A library for supervised and unsupervised learning.

    + Provides tools for model training, evaluation, and cross-validation.

    + Supports a wide range of algorithms, including classification, regression, clustering, and dimensionality reduction.

    + Includes utilities for preprocessing, such as scaling, encoding, and splitting data.

+ Common Use: Building and evaluating machine learning models, performing tasks like classification, regression, clustering, and feature selection.

In [1]:
import numpy as np
import pandas as pd
from sklearn.decomposition import NMF
from sklearn.preprocessing import MinMaxScaler


In [2]:
# Sample multiomics data (replace with your actual data/filepath)
# Each row represents a gene, and each column represents a sample
# Data should be non-negative

data = pd.read_csv('filepath/CLL_data_mRNA.csv')
data2 = pd.read_csv('filepath/CLL_data_Methylation.csv')
#data_rna = pd.DataFrame(np.random.rand(136, 50), index=['gene_{}'.format(i) for i in range(136)])
data_rna = pd.DataFrame(data, index=['gene_{}'.format(i) for i in range(136)])
#data_methylation = pd.DataFrame(np.random.rand(136, 50), index=['gene_{}'.format(i) for i in range(136)])
data_methylation = pd.DataFrame(data2, index=['gene_{}'.format(i) for i in range(136)])
data_protein = pd.DataFrame(np.random.rand(136, 50), index=['gene_{}'.format(i) for i in range(136)])
#data_protein = pd.DataFrame(data, index=['gene_{}'.format(i) for i in range(len(data.columns))])

In [3]:
###potential nans can be dropped before here:
# Drop columns with any NaN values
data_rna = data_rna.dropna(axis=1)
data_methylation = data_methylation.dropna(axis=1)
data_protein = data_protein.dropna(axis=1)

In [4]:
# Combine multiomics data
data_combined = pd.concat([data_rna, data_methylation, data_protein], axis=1)

#### Explanation:
<br>
Data Preparation:
<br>
The example data represents multi-omics data for RNA expression, methylation, and protein levels. These datasets are combined into a single matrix (data_combined), where rows represent genes and columns represent samples. Replace the sample data with your actual multiomics data.

Normalization:<br>
Since NMF requires non-negative values, it's crucial to scale the data. We use MinMaxScaler to normalize the data within the range [0, 1].


In [5]:
# Data normalization
scaler = MinMaxScaler()
data_normalized = scaler.fit_transform(data_combined)
data_normalized = pd.DataFrame(data_normalized, index=data_combined.index, columns=data_combined.columns)

Applying NMF:<br>
The NMF algorithm is applied with the specified number of components (in this case, 10). The model decomposes the normalized data matrix into two non-negative matrices: 
+ W (sample-component matrix)<br> 
and
+ H (component-feature matrix).

In [6]:
# Apply NMF
n_components = 10  # Number of components to extract
nmf = NMF(n_components=n_components, init='nndsvda', random_state=0, max_iter=500)
W = nmf.fit_transform(data_normalized)
H = nmf.components_



W captures the association between samples and NMF components (latent factors), while H captures the contribution of each original feature to each NMF component.

In [7]:
# Create dataframes for W and H
W_df = pd.DataFrame(W, index=data_normalized.index, columns=['NMF_{}'.format(i) for i in range(n_components)])
H_df = pd.DataFrame(H, index=['NMF_{}'.format(i) for i in range(n_components)], columns=data_normalized.columns)


In [8]:
# Print results
print("W (Sample-component matrix):")
print(W_df.head())
print("\nH (Component-feature matrix):")
print(H_df.head())

W (Sample-component matrix):
           NMF_0     NMF_1     NMF_2     NMF_3     NMF_4     NMF_5     NMF_6  \
gene_0  0.018561  0.000000  0.071710  0.074608  0.030406  0.000000  0.073465   
gene_1  0.010157  0.001872  0.080459  0.103289  0.101455  0.000000  0.074791   
gene_2  0.009635  0.002250  0.086954  0.079768  0.015968  0.101549  0.062943   
gene_3  0.009095  0.001158  0.103489  0.000000  0.290887  0.138927  0.000000   
gene_4  0.005647  0.000853  0.149064  0.089072  0.128870  0.067688  0.136676   

           NMF_7     NMF_8     NMF_9  
gene_0  0.435539  0.234751  0.108511  
gene_1  0.186611  0.051346  0.042895  
gene_2  0.013761  0.000000  0.360593  
gene_3  0.146690  0.404938  0.123772  
gene_4  0.000000  0.170508  0.000000  

H (Component-feature matrix):
             0          1         2          3         4          5   \
NMF_0  9.080151   0.000000  0.000000   0.000000  0.000000   0.000000   
NMF_1  5.832292  32.436065  0.000000  22.605456  3.197621  55.064605   
NMF_2  1.

Results Interpretation:<br>
+ The W_df matrix shows how each sample relates to the underlying NMF components.
+ The H_df matrix shows the importance of each original feature (e.g., genes, metabolites, proteins) in constructing each NMF component.

These decomposed matrices (W and H) can be used for further analysis like clustering, identifying latent biological factors, or associating with phenotypic data such as disease status or treatment response.

#### Potential Use Cases for NMF in Multiomics Data:
+ Dimensionality Reduction: Simplify complex multiomics datasets by focusing on the most important components.
+ Pattern Discovery: Uncover hidden biological patterns across multiple omic layers, such as common pathways or gene-metabolite interactions.
+ Feature Extraction: Identify key features (genes, proteins, metabolites) that drive biological processes or disease mechanisms.

By using NMF, we can achieve a more integrative understanding of biological data, ultimately paving the way for improved disease classification, biomarker discovery, and drug response prediction.
