### WIP Methylation Statistical Analysis

This code analyzes a multi-tissue DNA methylation dataset from mice, sourced from ['Multi-tissue DNA methylation age predictor in mouse'](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1203-5). It calculates average methylation rate, depth, and identifies tissue-specific and age-related differentiations in the samples. By employing  statistical techniques, this code will provide valuable insights into DNA methylation patterns and their association with aging processes. 

In [19]:
import gzip
import os
import pandas as pd
import sys
from typing import Tuple, List

### Load Data & Build Dataframe

The build_df function generates a pandas DataFrame from a specified sample name. It initializes an empty DataFrame and defines column names. It iterates through files in the directory, checking for gzip format and matching sample name. Data is read using read_csv, with tab as separator, and stored in a temporary DataFrame. If the main DataFrame is empty, the data is directly assigned. The function returns the compiled DataFrame from matching files.

In [20]:
directory = 'data/GSE93957_RAW/'
sample_list = ["Lung", "Heart", "Liver", "Cortex"]

def build_df(test) -> Tuple[pd.DataFrame, str]:
    # Create empty DataFrame to be filled
    df = pd.DataFrame()
    columns_names = ['chromosome', 's_loc', 'e_loc', 'methyl_rate', 's_depth', 'e_depth']

    # Iterate over files in the directory
    for filename in os.listdir(directory):
        if not test:
        # Check if the file is gzipped 
            if filename.endswith('.gz') and '_1wk_' in filename:  
                file_path = os.path.join(directory, filename)

                # Open the gzipped file in text mode
                with gzip.open(file_path, 'rt') as file:  
                    print(file_path)
                    # if len(df) > 0:                    
                    #     # Create temporary dataframe to concatenate results
                    #     temp_df = pd.read_csv(file_path, sep="\t", header=None, names=columns_names, low_memory=False)
                    #     pd.concat([df, temp_df], ignore_index=True)
                    # else:
                    df = pd.read_csv(file_path, sep="\t", header=None, names=columns_names, low_memory=False)
        else:
            df = pd.read_csv('data/GSE93957_RAW/sample_cortex.txt', sep="\t", header=None, names=columns_names, low_memory=False)
    return df, filename

### Build Final Result

In [24]:
def build_result(test = False):
    column_names = ['sites', 'ave depth', 'ave methylation', 'ave methylation > 2 depth']
    result = pd.DataFrame(columns=column_names)

    
    df, name = build_df(test = True)
    length = len(df)
    depth = (df['s_depth'] + df['e_depth']).mean()
    mean = df['methyl_rate'].mean()
    
    # Get the average methylation for rows with depth >= 2
    df = df.drop(df[df['e_depth'] - df['s_depth'] < 2].index)
    two_depth_mean = df['methyl_rate'].mean()
    result.loc[len(result)] = [length, depth, mean, two_depth_mean]

    return result, name



### Display menu for user

In this example we choose '5': all samples

In [25]:
# Build a DataFrame from the raw files in 'Data' directory
df, row_name = build_df(True)

# Perform statistical analysis and build a 'result DataFrame' that will be saved
df = build_result()
df['Row'] = row_name
column_order = ['Row', 'sites', 'ave depth', 'ave methylation', 'ave methylation > 2 depth']
df = df[column_order]
df.to_csv('result.csv', index=False)

In [23]:
# Display Results Table
df

Unnamed: 0,Row,sites,ave depth,ave methylation,ave methylation > 2
0,sample_cortex.txt,50.0,12.54,50.67676,6.689342
