# 2003-Bioinformatics-A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias

The idea is so that each patient's data has the same distribution.

"The goal of the Quantile method is to make the distribution of probe intensities for each array in a set of arrays the same.  The method is motivated by the idea that a quantile-quantile plot shows that the distribution of two data vectors is the same if the plot is a straight diagonal line and not the same if it is other than a diagonal line"

Reference:

[1]B. M. Bolstad, R. A. Irizarry, M. Åstrand, and T. P. Speed, “A comparison of normalization methods for high density
oligonucleotide array data based on variance and bias,” Bioinformatics, vol. 19, no. 2, pp. 185–193, Jan. 2003, doi: 10.1093/bioinformatics/19.2.185.
  

In [38]:
import numpy as np
import pandas as pd

np.set_printoptions(precision=2)

# Step 1: Create an example data
# "1. Given n array of length p, 
# form X of dimension p × n where each array is a column"
df = pd.DataFrame({
    "Sample1": [13, 4, 25, 9],
    "Sample2": [7, 18, 15, 22],
    "Sample3": [20, 6, 11, 17]
}, index=["GeneA", "GeneB", "GeneC", "GeneD"])

print("Original data:")
print(df)

# Step 2: Sort each column (independently)
# "2. Sort each column of X to give Xsort"
sorted_matrix = np.sort(df.values, axis=0)

print("Sorted matrix per column:")
print(sorted_matrix)

# Step 3: Compute row-wise mean
# "3. Take the means across rows of Xsort and assign this 
# mean to each element in the row to get X'sort"
rank_means = np.mean(sorted_matrix, axis=1)

print("Row-wise means:")
print(rank_means)

# Step 4: Assign rank means back to original ranks
# "4. Get $X_normalized$ by rearranging each column of X'_sort 
# to have the same ordering as original X"
df_normalized = pd.DataFrame(0.0, index=df.index, columns=df.columns)
for i in range(df.shape[1]): # Iterate over each sample (column)
    ranks = np.argsort(np.argsort(df.iloc[:, i].values)) # Get ranks
    df_normalized.iloc[:, i] = rank_means[ranks] # Assign mean to original ranks

print("Quantile normalized data:")
print(df_normalized.round(2))

Original data:
       Sample1  Sample2  Sample3
GeneA       13        7       20
GeneB        4       18        6
GeneC       25       15       11
GeneD        9       22       17
Sorted matrix per column:
[[ 4  7  6]
 [ 9 15 11]
 [13 18 17]
 [25 22 20]]
Row-wise means:
[ 5.67 11.67 16.   22.33]
Quantile normalized data:
       Sample1  Sample2  Sample3
GeneA    16.00     5.67    22.33
GeneB     5.67    16.00     5.67
GeneC    22.33    11.67    11.67
GeneD    11.67    22.33    16.00


In [39]:
# Example how to get ranks
print("Example data:")
print(df.iloc[:, 0].values)

print("Indices that would sort the array:")
print(np.argsort(df.iloc[:, 0].values)) # Get the indices that would sort an array

print("Ranks of the original values:")
print(np.argsort(np.argsort(df.iloc[:, 0].values))) # Get the ranks of the original values

Example data:
[13  4 25  9]
Indices that would sort the array:
[1 3 0 2]
Ranks of the original values:
[2 0 3 1]


In [45]:
def quantile_normalize(df: pd.DataFrame):
    sorted_matrix = np.sort(df.values, axis=0)
    rank_means = np.mean(sorted_matrix, axis=1)
    df_normalized = pd.DataFrame(0.0, index=df.index, columns=df.columns)
    for i in range(df.shape[1]): # Iterate over each sample (column)
        ranks = np.argsort(np.argsort(df.iloc[:, i].values)) # Get ranks
        df_normalized.iloc[:, i] = rank_means[ranks] # Assign mean to original ranks
    return df_normalized

df_normalized = quantile_normalize(df)

print("Original data:")
print(df)

print("Quantile normalized data:")
print(df_normalized.round(2))

Original data:
       Sample1  Sample2  Sample3
GeneA       13        7       20
GeneB        4       18        6
GeneC       25       15       11
GeneD        9       22       17
Quantile normalized data:
       Sample1  Sample2  Sample3
GeneA    16.00     5.67    22.33
GeneB     5.67    16.00     5.67
GeneC    22.33    11.67    11.67
GeneD    11.67    22.33    16.00
