### WIP Methylation Statistical Analysis

This code analyzes a multi-tissue DNA methylation dataset from mice, sourced from ['Multi-tissue DNA methylation age predictor in mouse'](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1203-5). It calculates average methylation rate, depth, and identifies tissue-specific and age-related differentiations in the samples. By employing  statistical techniques, this code will provide valuable insights into DNA methylation patterns and their association with aging processes. 

In [55]:
import gzip
import os
import pandas as pd
import sys
from typing import Tuple, List

In [56]:
directory = 'data/GSE93957_RAW/'

### Load Data & Build Dataframe

The build_df function generates a pandas DataFrame from a specified sample name. It initializes an empty DataFrame and defines column names. It iterates through files in the directory, checking for gzip format and matching sample name. Data is read using read_csv, with tab as separator, and stored in a temporary DataFrame. If the main DataFrame is empty, the data is directly assigned. The function returns the compiled DataFrame from matching files.

In [57]:

sample_list = ["Lung", "Heart", "Liver", "Cortex"]

def build_df(filename) -> Tuple[pd.DataFrame, str]:
    # Create empty DataFrame to be filled
    df = pd.DataFrame()
    columns_names = ['chromosome', 's_loc', 'e_loc', 'methyl_rate', 's_depth', 'e_depth']
    
    # # Choose preset file to perform statistical analysis on
    # if test:
    #     filename = "GSM2465667_M04NB_1wk_Liver.cov.txt.gz"
    #     columns_names = ['chromosome', 's_loc', 'e_loc', 'methyl_rate', 's_depth', 'e_depth']
    #     with gzip.open('data/GSE93957_RAW/GSM2465667_M04NB_1wk_Liver.cov.txt.gz', 'rt') as file: 
    #         df = pd.read_csv('data/GSE93957_RAW/GSM2465667_M04NB_1wk_Liver.cov.txt.gz', sep="\t", header=None, names=columns_names, low_memory=False)
    #     return df, filename


    file_path = os.path.join(directory, filename)
    # Open the gzipped file in text mode
    with gzip.open(file_path, 'rt') as file:  
        print(file_path)
        df = pd.read_csv(file_path, sep="\t", header=None, names=columns_names, low_memory=False)
      
    return df, filename

### Build Final Result

In [60]:
def build_result(test = False):
    column_names = ['sites', 'ave depth', 'ave methylation', 'ave methylation > 2 depth', 'ave methylation > 5 depth']
    result = pd.DataFrame(columns=column_names)
    
    
    if not test:
        for filename in os.listdir(directory):
            df, name = build_df(filename)
            length = len(df)
            depth = (df['s_depth'] + df['e_depth']).mean()
            mean = df['methyl_rate'].mean()
            
            # Get the average methylation for rows with depth >= 2
            df = df.drop(df[df['e_depth'] + df['s_depth'] < 2].index)
            df = df.drop(df[df['e_depth'] + df['s_depth'] > 100].index)
            two_depth_mean = df['methyl_rate'].mean()

            df = df.drop(df[df['e_depth'] + df['s_depth'] < 5].index)
            five_depth_mean = df['methyl_rate'].mean()
            result.loc[len(result)] = [length, depth, mean, two_depth_mean, five_depth_mean]
    else: 
        build_df(filename="GSM2465667_M04NB_1wk_Liver.cov.txt.gz")
        length = len(df)
        depth = (df['s_depth'] + df['e_depth']).mean()
        mean = df['methyl_rate'].mean()
        
        # Get the average methylation for rows with depth >= 2
        df = df.drop(df[df['e_depth'] + df['s_depth'] < 2].index)
        df = df.drop(df[df['e_depth'] + df['s_depth'] > 100].index)
        two_depth_mean = df['methyl_rate'].mean()

        df = df.drop(df[df['e_depth'] + df['s_depth'] < 5].index)
        five_depth_mean = df['methyl_rate'].mean()
        result.loc[len(result)] = [length, depth, mean, two_depth_mean, five_depth_mean]
        

    return result, name



### Display menu for user

In this example we choose '5': all samples

In [61]:
# Perform statistical analysis and build a 'result DataFrame' that will be saved
df, row_name = build_result()
df['Row'] = row_name
column_order = ['Row', 'sites', 'ave depth', 'ave methylation', 'ave methylation > 2 depth', 'ave methylation > 5 depth']
df = df[column_order]
df.to_csv('result.csv', index=False)

data/GSE93957_RAW/GSM2465633_M00018724_27wk_Liver.cov.txt.gz
data/GSE93957_RAW/GSM2465619_M00018362_41wk_Cortex.cov.txt.gz
data/GSE93957_RAW/GSM2465667_M04NB_1wk_Liver.cov.txt.gz
data/GSE93957_RAW/GSM2465627_M00018381_41wk_Cortex.cov.txt.gz
data/GSE93957_RAW/GSM2465631_M00018724_27wk_Cortex.cov.txt.gz
data/GSE93957_RAW/GSM2465653_M02NB_1wk_Lung.cov.txt.gz
data/GSE93957_RAW/GSM2465636_M00018752_27wk_Heart.cov.txt.gz
data/GSE93957_RAW/GSM2465632_M00018724_27wk_Heart.cov.txt.gz
data/GSE93957_RAW/GSM2465637_M00018752_27wk_Liver.cov.txt.gz
data/GSE93957_RAW/GSM2465625_M00018363_41wk_Liver.cov.txt.gz
data/GSE93957_RAW/GSM2465668_M04NB_1wk_Lung.cov.txt.gz
data/GSE93957_RAW/GSM2465662_M0420527_14wk_Heart.cov.txt.gz
data/GSE93957_RAW/GSM2465665_M04NB_1wk_Cortex.cov.txt.gz
data/GSE93957_RAW/GSM2465642_M00018754_27wk_Lung.cov.txt.gz
data/GSE93957_RAW/GSM2465676_M0520522_14wk_Lung.cov.txt.gz
data/GSE93957_RAW/GSM2465621_M00018362_41wk_Liver.cov.txt.gz
data/GSE93957_RAW/GSM2465675_M0520522_14wk_Liv

In [65]:
# Display Results Table
df

Unnamed: 0,Row,sites,ave depth,ave methylation,ave methylation > 2 depth,ave methylation > 5 depth
0,GSM2465655_M03NB_1wk_Heart.cov.txt.gz,4748562.0,11.236037,42.379657,36.123860,36.826557
1,GSM2465655_M03NB_1wk_Heart.cov.txt.gz,4121662.0,14.977096,38.715772,37.048571,38.045658
2,GSM2465655_M03NB_1wk_Heart.cov.txt.gz,4147692.0,13.342087,36.218352,34.165614,35.222294
3,GSM2465655_M03NB_1wk_Heart.cov.txt.gz,4165468.0,15.123178,37.681559,36.102748,37.034388
4,GSM2465655_M03NB_1wk_Heart.cov.txt.gz,4086565.0,15.372909,38.633424,37.273794,39.312597
...,...,...,...,...,...,...
57,GSM2465655_M03NB_1wk_Heart.cov.txt.gz,3917926.0,12.731100,35.657432,34.545901,35.690143
58,GSM2465655_M03NB_1wk_Heart.cov.txt.gz,3939791.0,9.395831,36.255070,34.480659,35.639115
59,GSM2465655_M03NB_1wk_Heart.cov.txt.gz,3669961.0,7.935134,34.789121,34.387766,36.722110
60,GSM2465655_M03NB_1wk_Heart.cov.txt.gz,4257344.0,19.341303,34.964655,32.868154,33.876040


In [64]:
df['ave methylation > 5 depth'].mean()

36.96097574055515