
# Transcriptomic Descriptor Analysis with Multilag Autocorrelation

This notebook provides a comprehensive analysis of transcriptomic data, including the calculation of various mathematical descriptors and multilag autocorrelation values for each gene.

## Objectives:
- Compute advanced mathematical descriptors for gene expression data.
- Evaluate multilag autocorrelation to understand stability and patterns in gene expression.


In [1]:

# Import required libraries
import pandas as pd
import numpy as np
from scipy.stats import skew, kurtosis

# Load the dataset
file_path = '/mnt/data/GSE136116_raw.counts (3).csv'  # Adjust the path if necessary
data = pd.read_csv(file_path)

# Extract numeric columns (samples data)
expression_data = data.iloc[:, 1:]

# Display the first few rows of the dataset
data.head()


Unnamed: 0,Gene_name,R1_T36_1,R1_T36_2,R1_T36_3,R1_Ctl_1,R1_Ctl_2,R1_Ctl_3
0,APR1,2720,4142,6205,13641,14974,12729
1,SEP7,1175,689,768,189,183,388
2,AAF1,719,677,455,263,301,293
3,AAH1,148,418,290,150,166,588
4,AAP1,75,60,103,99,139,78


## Calculation of Mathematical Descriptors

In [2]:

# Define functions for calculating descriptors
def gini_index(values):
    sorted_values = np.sort(values)
    n = len(values)
    cumulative = np.cumsum(sorted_values) / np.sum(sorted_values)
    return 1 - 2 * np.sum((n - np.arange(1, n + 1)) * cumulative) / n

def shannon_entropy(values):
    probabilities = values / np.sum(values)
    return -np.sum(probabilities * np.log(probabilities + 1e-10))  # Add small constant to avoid log(0)

def autocorrelation(values, lag=1):
    n = len(values)
    mean = np.mean(values)
    numerator = np.sum((values[:n-lag] - mean) * (values[lag:] - mean))
    denominator = np.sum((values - mean) ** 2)
    return numerator / denominator if denominator != 0 else np.nan

# Calculate metrics
results = pd.DataFrame({
    'Gene': data['Gene_name'],
    'Mean': expression_data.mean(axis=1),
    'SD': expression_data.std(axis=1),
    'CV': expression_data.std(axis=1) / expression_data.mean(axis=1),
    'Skewness': expression_data.apply(skew, axis=1),
    'Kurtosis': expression_data.apply(kurtosis, axis=1),
    'Range': expression_data.max(axis=1) - expression_data.min(axis=1),
    'Gini_Index': expression_data.apply(gini_index, axis=1),
    'Shannon_Entropy': expression_data.apply(shannon_entropy, axis=1),
    'Signal_to_Noise_Ratio': expression_data.mean(axis=1) / expression_data.std(axis=1),
    'Peakness_Index': expression_data.max(axis=1) / expression_data.mean(axis=1)
})


  cumulative = np.cumsum(sorted_values) / np.sum(sorted_values)


## Multilag Autocorrelation Analysis

In [3]:

# Define function for multilag autocorrelation
def autocorrelation_multiple_lags(values, max_lag=5):
    autocorrs = {}
    for lag in range(1, max_lag + 1):
        n = len(values)
        mean = np.mean(values)
        numerator = np.sum((values[:n-lag] - mean) * (values[lag:] - mean))
        denominator = np.sum((values - mean) ** 2)
        autocorrs[f'Autocorrelation_Lag_{lag}'] = numerator / denominator if denominator != 0 else np.nan
    return autocorrs

# Calculate autocorrelation for lags 1 through 5
autocorr_df = expression_data.apply(lambda x: autocorrelation_multiple_lags(x.values), axis=1)
autocorr_df = pd.DataFrame(autocorr_df.tolist(), index=expression_data.index)

# Add autocorrelation results to the main results table
results = pd.concat([results, autocorr_df], axis=1)

# Display the updated dataset
results.head()


Unnamed: 0,Gene,Mean,SD,CV,Skewness,Kurtosis,Range,Gini_Index,Shannon_Entropy,Signal_to_Noise_Ratio,Peakness_Index,Autocorrelation_Lag_1,Autocorrelation_Lag_2,Autocorrelation_Lag_3,Autocorrelation_Lag_4,Autocorrelation_Lag_5
0,APR1,9068.5,5328.328697,0.587565,-0.069968,-1.775191,12254,-0.049285,1.635186,1.701941,1.65121,0.569964,-0.031843,-0.483277,-0.391141,-0.163704
1,SEP7,565.333333,386.604018,0.683851,0.462802,-1.039394,992,0.037146,1.593736,1.462306,2.07842,0.315629,0.088677,-0.418378,-0.341257,-0.144671
2,AAF1,451.333333,202.841482,0.449427,0.436358,-1.56475,456,-0.319916,1.70962,2.225054,1.593058,0.547609,-0.05955,-0.41277,-0.369281,-0.206008
3,AAH1,293.333333,179.031468,0.610335,0.742366,-0.9181,440,-0.137121,1.645578,1.638446,2.004545,-0.232906,-0.369369,0.024801,0.344693,-0.267219
4,AAP1,92.333333,27.883089,0.301983,0.618357,-0.62441,79,-0.515042,1.754924,3.311446,1.505415,-0.018293,0.000457,-0.457211,-0.088864,0.063911



# Conclusions

- The calculated descriptors provide insights into the variability, symmetry, and diversity of gene expression.
- Multilag autocorrelation reveals stability and trends in expression values across samples.
- These analyses can guide further investigation into gene regulatory mechanisms.

The full table of results includes all descriptors and autocorrelations for lags 1 through 5.
