# Computational Genetics: Statistics, Supervised and Unsupervised Learning
### Final Report (using Python)
Olyssa Sperling (614143)
olyssa.sperling@student.hu-berlin.de
Module: Systems Biology: Computational Analysis and Interpretation of High-throughput Data

#### Background
##### 1) CpG methylation data:
CpG methylation data refers to information about the methylation status of cytosine-guanine dinucleotides (CpG sites) in DNA. DNA methylation is an epigenetic modification where a methyl group is added to the 5th carbon of the cytosine ring, primarily at CpG sites. Methylation patterns are crucial for normal development and cellular differentiation. CpG methylation data is often presented as a table with rows as CpG sites and columns as samples.

#####
##### 2) CpG methylation influenced by age:
With age, CpG methylation undergoes a combination of global hypomethylation, localized hypermethylation, and epigenetic drift. These patterns are not just markers of aging but also play active roles in age-related diseases. By studying these changes, researchers can better understand the molecular basis of aging and develop interventions to promote healthy aging.

Key aspects are:
* **Global Hypomethylation**: As individuals age, there is a general loss of methylation across the genome, particularly in repetitive elements and intergenic regions. Hypomethylation in these regions can lead to genomic instability, activation of transposable elements, and inappropriate gene expression


* **Local Hypermethylation**: While global hypomethylation occurs, some specific CpG sites, especially in promoter regions of certain genes, become hypermethylated with age. Hypermethylation often occurs in tumor suppressor genes and pathways involved in cell cycle regulation, contributing to age-related diseases, including cancer.

* **Epigenetic drift**: Epigenetic drift refers to the stochastic (random) changes in DNA methylation patterns that accumulate with age, leading to increased inter-individual variability in methylation profiles over time. This drift is influenced by both genetic and environmental factors, such as diet, smoking, and exposure to pollutants.

* **Age-Associated Differentially Methylated Positions (aDMPs)**: Certain CpG sites show predictable changes in methylation with age, known as age-associated differentially methylated positions (aDMPs). These aDMPs are often enriched in regulatory regions and genes involved in development, such as those in the HOX gene family.

*  **Impact on Health**: 
    * *Cancer*: Age-related methylation changes can lead to hypermethylation of tumor suppressor genes and hypomethylation of oncogenes.
    * *Neurodegeneration*: Methylation changes in genes involved in neuronal function and inflammation contribute to conditions like Alzheimer's disease.
    * *Immune System Decline*: Age-related methylation patterns in immune-related genes contribute to immunosenescence (decline of immune function).

#### The dataset
The dataset contains CpG methylation values per CpG and per individual as a table (“metRmOlWithDbgapId.txt”). Each column is an individual and each row is a CpG, values in the table are methylation values. The subject/sample/individual ids are on the first row. “subjects.txt” has information on the individuals/samples/subjects. The most important attribute here is age. The ids (db gap ids) in this table should match the columns in the first table. There are other attributes such as sex and race.

#### The Problem
The main question is what is the relationship between age and methylation? We explore this with the following tasks. 

##### A) Theory

0. Read tables and merge them or organize them so that you have an age value for each sample/subject. Your predictive variables are CpG methylation values per sample. Ignore other covariates in “subjects.txt” for simplicity, we only care about age. Age can have minus values if the individual is not born yet. 

1. Which strategies/methods can be used to test if methylation is predictive of Age in this data set ? List some methods and how they can be used to test this. If possible, give software package names that can be used for the strategies you mention.

##### B) Practical

2. Data processing/preparation tasks

2a. If you want to do some normalization and/or filtering on the variability of predictive features now is the time. Normalization may not be necessary for CpG methylation. The methylation values are bimodal and might be too complicated to first transform them and do normalization (we ideally need some data transformation to make them look more Gaussian). Such transformation & normalization is optional, and may not impact the model performance so much. [Add on: don’t worry about normalization but you might have to do filtering to reduce the number of variables (CpGs) due to limitations in computing power of your workstations but also predictors that are not variable are not likely to impact your model]

2b. Randomly select 10% of the samples as a test set. We do training with cross-validation on the 90% and test performance on the test set.     [Add on: If you can’t figure out how to do this, skip this and rely on CV accuracy only to answer questions]


3. Is methylation predictive of age (previous research indicates so)?  What are the CV performance and test set performance of your model? Pick two strategies you listed for question 1 and implement them to answer this question. 

4. Are all CpGs associated with age? If not, which CpGs are more important for age prediction? Do both methods agree have similar variable importance? Plot importance values as barplots for both methods. [Hint: You need to do some sort of variable importance or variable selection for both methods] 

5. ?

5a. Use a sensible cutoff to define “the most important“ using variable importance from both methods. Are you picking the top 10, top 20, or top x percentile? Just define “the top most important”, not really any wrong answers here. Combine the importance metrics from both methods for a unified importance metric, use your imagination here.


5b. Then, make a heatmap using CpG methylation values for the most important CpGs, but only cluster the CpGs. The sample columns should be sorted by age but not clustered. You might be able to see a visual pattern of CpG methylation values that is associated with the age of the samples on the heatmap. Can you observe such a pattern, display the heatmap as described above, and explain the pattern you see in writing? 


6. Pick the top 3 most important variables and build a simple linear model with them to predict the age, what is the performance of this model on the test set? Is it better or worse than the initial machine learning models with more variables? Show an accuracy metric on the test set as bar plot. 