### WIP Methylation Statistical Analysis

This code analyzes a multi-tissue DNA methylation dataset from mice, sourced from ['Multi-tissue DNA methylation age predictor in mouse'](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1203-5). It calculates average methylation rate, depth, and identifies tissue-specific and age-related differentiations in the samples. By employing  statistical techniques, this code will provide valuable insights into DNA methylation patterns and their association with aging processes. 

In [20]:
import gzip
import os
import pandas as pd
import sys

### Load Data & Build Dataframe

The build_df function generates a pandas DataFrame from a specified sample name. It initializes an empty DataFrame and defines column names. It iterates through files in the directory, checking for gzip format and matching sample name. Data is read using read_csv, with tab as separator, and stored in a temporary DataFrame. If the main DataFrame is empty, the data is directly assigned. The function returns the compiled DataFrame from matching files.

In [21]:
directory = 'data/GSE93957_RAW/'
sample_list = ["Lung", "Heart", "Liver", "Cortex"]

def build_df(sample: str) -> pd.DataFrame:
    # Create empty DataFrame to be filled
    df = pd.DataFrame()
    columns_names = ['chromosome', 's_loc', 'e_loc', 'methyl_rate', 's_depth', 'e_depth']

    # Iterate over files in the directory
    for filename in os.listdir(directory):

        # Check if the file is gzipped and the sample matches what we're working on
        if filename.endswith('.gz') and sample in filename and '_1wk_' in filename:  
            file_path = os.path.join(directory, filename)

            # Open the gzipped file in text mode
            with gzip.open(file_path, 'rt') as file:  
                print(file_path)
                if len(df) > 0:                    
                    # Create temporary dataframe to concatenate results
                    temp_df = pd.read_csv(file_path, sep="\t", header=None, names=columns_names, low_memory=False)
                    pd.concat([df, temp_df], ignore_index=True)
                else:
                    df = pd.read_csv(file_path, sep="\t", header=None, names=columns_names, low_memory=False)
    return df

### Build Final Result

In [22]:
def build_result():
    column_names = ['sites', 'ave depth', 'ave methylation']
    result = pd.DataFrame(columns=column_names)

    for sample in sample_list:
        df = build_df(sample)
        length = len(df)
        depth = (df['s_depth'] + df['e_depth']).mean()
        mean = df['methyl_rate'].mean()
        result.loc[len(result)] = [length, depth, mean]

    return result



### Display menu for user

In this example we choose '5': all samples

In [23]:
print("Pick the number of which tissue sample you'd like to perform statistical analysis on: ")
print("""
    1. Lung
    2. Heart
    3. Liver
    4. Cortex
    5. Save all Samples""")

opt_sel = int(input("--->"))

if opt_sel < 5:
    df = build_df(sample_list[opt_sel-1])
    # chromo_df = df[df['chromosome'] == '10']
    # print(chromo_df.head())
    print(f"Total # of Samples: {len(df)}")
    print(f"Average Methylation Rate: {df['methyl_rate'].mean()}%")
    print(f"Average Methylation depth: {(df['s_depth'] + df['e_depth']).mean()}")
else:
    # Analyze all the tissue samples and save to .CSV
    df = build_result()
    df['Row'] = sample_list
    column_order = ['Row', 'sites', 'ave depth', 'ave methylation']
    df = df[column_order]
    df.to_csv('result.csv', index=False)

Pick the number of which tissue sample you'd like to perform statistical analysis on: 

    1. Lung
    2. Heart
    3. Liver
    4. Cortex
    5. Save all Samples
data/GSE93957_RAW/GSM2465653_M02NB_1wk_Lung.cov.txt.gz
data/GSE93957_RAW/GSM2465668_M04NB_1wk_Lung.cov.txt.gz
data/GSE93957_RAW/GSM2465656_M03NB_1wk_Lung.cov.txt.gz
data/GSE93957_RAW/GSM2465650_M01NB_1wk_Lung.cov.txt.gz
data/GSE93957_RAW/GSM2465648_M01NB_1wk_Heart.cov.txt.gz
data/GSE93957_RAW/GSM2465666_M04NB_1wk_Heart.cov.txt.gz
data/GSE93957_RAW/GSM2465655_M03NB_1wk_Heart.cov.txt.gz
data/GSE93957_RAW/GSM2465667_M04NB_1wk_Liver.cov.txt.gz
data/GSE93957_RAW/GSM2465652_M02NB_1wk_Liver.cov.txt.gz
data/GSE93957_RAW/GSM2465649_M01NB_1wk_Liver.cov.txt.gz
data/GSE93957_RAW/GSM2465665_M04NB_1wk_Cortex.cov.txt.gz
data/GSE93957_RAW/GSM2465654_M03NB_1wk_Cortex.cov.txt.gz
data/GSE93957_RAW/GSM2465651_M02NB_1wk_Cortex.cov.txt.gz
data/GSE93957_RAW/GSM2465647_M01NB_1wk_Cortex.cov.txt.gz


In [24]:
# Display Results Table
df

Unnamed: 0,Row,sites,ave depth,ave methylation
0,Lung,4085551.0,15.241029,35.427501
1,Heart,4021292.0,14.609889,35.120765
2,Liver,4147692.0,13.342087,36.218352
3,Cortex,4353634.0,14.978642,38.081742
