
# Transcriptomic Descriptor Analysis with Multilag Autocorrelation

This notebook provides a comprehensive analysis of transcriptomic data, including the calculation of various mathematical descriptors and multilag autocorrelation values for each gene.

## Objectives:
- Compute advanced mathematical descriptors for gene expression data.
- Evaluate multilag autocorrelation to understand stability and patterns in gene expression.


In [None]:

# Import required libraries
import pandas as pd
import numpy as np
from scipy.stats import skew, kurtosis

# Load the dataset
file_path = '/mnt/data/GSE136116_raw.counts (3).csv'  # Adjust the path if necessary
data = pd.read_csv(file_path)

# Extract numeric columns (samples data)
expression_data = data.iloc[:, 1:]

# Display the first few rows of the dataset
data.head()


## Calculation of Mathematical Descriptors

In [None]:

# Define functions for calculating descriptors
def gini_index(values):
    sorted_values = np.sort(values)
    n = len(values)
    cumulative = np.cumsum(sorted_values) / np.sum(sorted_values)
    return 1 - 2 * np.sum((n - np.arange(1, n + 1)) * cumulative) / n

def shannon_entropy(values):
    probabilities = values / np.sum(values)
    return -np.sum(probabilities * np.log(probabilities + 1e-10))  # Add small constant to avoid log(0)

def autocorrelation(values, lag=1):
    n = len(values)
    mean = np.mean(values)
    numerator = np.sum((values[:n-lag] - mean) * (values[lag:] - mean))
    denominator = np.sum((values - mean) ** 2)
    return numerator / denominator if denominator != 0 else np.nan

# Calculate metrics
results = pd.DataFrame({
    'Gene': data['Gene_name'],
    'Mean': expression_data.mean(axis=1),
    'SD': expression_data.std(axis=1),
    'CV': expression_data.std(axis=1) / expression_data.mean(axis=1),
    'Skewness': expression_data.apply(skew, axis=1),
    'Kurtosis': expression_data.apply(kurtosis, axis=1),
    'Range': expression_data.max(axis=1) - expression_data.min(axis=1),
    'Gini_Index': expression_data.apply(gini_index, axis=1),
    'Shannon_Entropy': expression_data.apply(shannon_entropy, axis=1),
    'Signal_to_Noise_Ratio': expression_data.mean(axis=1) / expression_data.std(axis=1),
    'Peakness_Index': expression_data.max(axis=1) / expression_data.mean(axis=1)
})


## Multilag Autocorrelation Analysis

In [None]:

# Define function for multilag autocorrelation
def autocorrelation_multiple_lags(values, max_lag=5):
    autocorrs = {}
    for lag in range(1, max_lag + 1):
        n = len(values)
        mean = np.mean(values)
        numerator = np.sum((values[:n-lag] - mean) * (values[lag:] - mean))
        denominator = np.sum((values - mean) ** 2)
        autocorrs[f'Autocorrelation_Lag_{lag}'] = numerator / denominator if denominator != 0 else np.nan
    return autocorrs

# Calculate autocorrelation for lags 1 through 5
autocorr_df = expression_data.apply(lambda x: autocorrelation_multiple_lags(x.values), axis=1)
autocorr_df = pd.DataFrame(autocorr_df.tolist(), index=expression_data.index)

# Add autocorrelation results to the main results table
results = pd.concat([results, autocorr_df], axis=1)

# Display the updated dataset
results.head()



# Conclusions

- The calculated descriptors provide insights into the variability, symmetry, and diversity of gene expression.
- Multilag autocorrelation reveals stability and trends in expression values across samples.
- These analyses can guide further investigation into gene regulatory mechanisms.

The full table of results includes all descriptors and autocorrelations for lags 1 through 5.
