# Are our significant DMGs the core enrichment genes in our identified enriched pathways?
We want to combine what we've found for statistically signficant differentially methylated genes and connect it to our identified enriched pathways so we can start to make sense of things biologically.

I have generated two csv files for both **phase 1 warm vs. control oysters**:
- phase1_wc_genes.csv - list of significant (adjusted p-value < 0.05) DMGs
- p1_wc_pathway.csv - list of enriched pathways from KEGG

Each enriched pathway contains a list of 'core enrichment genes' - these are a list of genes that are reported as part of the 'core enrichment' and contribute to the observed enrichment score.

The thinking is that maybe some of our signficant DMGs are part of that core enrichment group, which could tell us that that pathway is especially important/biologically relevant.

#### I. Load packages

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


#### II. Load, clean, and prep both csv files

In [6]:
# load in csv file
pathway <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/lfc_kegg_pathways/unfiltered_pathways_p1wc.csv')

# clean headers and columns
pathway <- pathway[,-1]

# checking dimensions
dim(pathway) #119 pathways, 11 rows of info/meta data

head(pathway)

Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>
1,cvn00053,Ascorbate and aldarate metabolism,12,0.7513575,1.659577,0.009928054,0.2092344,0.1874827,1456,"tags=50%, list=11%, signal=45%",111124535/111103451/111124599/111127562/111112920/111115614
2,cvn00910,Nitrogen metabolism,10,0.7752533,1.614827,0.008414069,0.2092344,0.1874827,1396,"tags=50%, list=10%, signal=45%",111134700/111100398/111100399/111126492/111135592
3,cvn00511,Other glycan degradation,37,0.5915432,1.568428,0.007095078,0.2092344,0.1874827,2224,"tags=30%, list=16%, signal=25%",111106921/111106925/111106928/111119851/111119435/111120040/111113388/111119434/111106926/111119431/111106930
4,cvn00052,Galactose metabolism,21,0.64846,1.564709,0.011543831,0.2092344,0.1874827,2261,"tags=38%, list=17%, signal=32%",111101197/111118471/111101820/111113388/111109442/111099882/111120703/111118006
5,cvn03250,Viral life cycle - HIV-1,28,0.6040501,1.53519,0.012104468,0.2092344,0.1874827,2630,"tags=46%, list=19%, signal=37%",111124701/111124696/111129825/111111579/111108190/111135084/111128997/111124977/111106750/111123417/111130886/111104027/111135329
6,cvn03015,mRNA surveillance pathway,61,0.5365003,1.533996,0.004750431,0.2092344,0.1874827,1963,"tags=36%, list=14%, signal=31%",111135039/111127981/111101410/111100273/111134286/111121238/111129219/111132883/111122880/111118318/111103910/111119442/111126090/111138352/111108043/111130886/111118849/111136672/111104361/111129063/111135694/111129838


I want to know if all of my genes provided in geneList are in the core enrichment, or if the core enrichment only includes a subset of my total genes

- total genes in my geneList for KEGG - around 13,000 genes
- total enriched pathways - 121
- total number of significant DMGs - 189
- total number of unique genes in the core_enrichment column - **805** (from code below)

since this number (805) does not match my total genes in the data set, I know that not all genes are represented in the core_enrichment column, and not all are significant DMGs

In [18]:
# Split the strings in 'genes_column' into lists
gene_lists <- strsplit(pathway$core_enrichment, '/')

# Flatten the lists
all_genes <- unlist(gene_lists)

# Count the unique genes
unique_genes_count <- length(unique(all_genes))

print(paste("Number of unique genes in the entire DataFrame:", unique_genes_count))


[1] "Number of unique genes in the entire DataFrame: 805"


In [8]:
# load data frame
genes <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/unfiltered_lfc_p1wc.csv')

# KEGG uses entrez IDs, which are my ensembl IDs without the 'LOC' in front of them, so need to convert those
genes$X <- substr(genes$X, start = 4, stop = nchar(genes$X))

# only grabbing the columns I care about
genes <- select(genes, X, log2FoldChange, padj)

# renaming columns to make more sense
colnames(genes) <- c('gene', 'lfc', 'padj')

# only selecting significant genes
genes <- filter(genes, genes$padj < 0.05)

# checking dimensions
dim(genes) # 189 sig DMGs

head(genes)

Unnamed: 0_level_0,gene,lfc,padj
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
1,111128103,1.644796,0.03065827
2,111137770,2.657125,0.01784408
3,111111295,2.361073,0.01369104
4,111125391,2.206213,0.02068009
5,111110197,1.864207,0.02086317
6,111115675,2.78232,7.896812e-05


#### III. Are the signficant genes in the core enrichment of the pathways?
Taking our list of siginificant genes, and going line by line in our pathways to see if our sig. gene matches any of the core enrichment genes

In [9]:
# generated from ChatGPT

# Sample dataframes
df1 <- genes
df2 <- pathway

# Function to check if any gene in df1 matches genes in a row of df2
get_gene_matches <- function(row_df2, df1_genes) {
  genes2 <- unlist(strsplit(as.character(row_df2), "/"))
  count <- sum(genes2 %in% df1_genes)
  return(count)
}

# Iterate over each row of df2
matches_count <- sapply(df2$core_enrichment, get_gene_matches, df1_genes = df1$gene)

# Add the matches count to df2
df2$MatchesCount <- matches_count

# Sort df with highest match counts at the top
gene_pathway_match <- df2[order(-df2$MatchesCount),]
head(gene_pathway_match)


Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment,MatchesCount
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>,<int>
7,cvn04814,Motor proteins,111,0.4819493,1.450226,0.005775221,0.2092344,0.1874827,1180,"tags=18%, list=9%, signal=17%",111136151/111134768/111107338/111102596/111112439/111103394/111115784/111129526/111107250/111127380/111134843/111137068/111131563/111119946/111134888/111120500/111129376/111130940/111125250/111131555,7
10,cvn04144,Endocytosis,132,0.4453853,1.356744,0.018375737,0.2223464,0.1992317,3187,"tags=33%, list=24%, signal=25%",111120187/111125099/111112319/111119513/111112439/111119512/111136896/111104852/111102907/111107174/111136866/111125956/111134954/111112863/111116971/111112700/111135084/111133388/111134171/111121253/111115795/111127289/111129312/111134242/111121335/111121437/111129503/111135594/111106223/111123210/111125223/111105462/111104835/111104585/111104028/111133563/111105923/111119177/111101822/111104196/111137900/111112119/111123772,5
37,cvn03040,Spliceosome,92,0.3987387,1.182876,0.171954964,0.5794411,0.5192034,1558,"tags=16%, list=12%, signal=15%",111112733/111137770/111119513/111119512/111129112/111133954/111121854/111121021/111118318/111119442/111134531/111114893/111136440/111135640/111136164,4
2,cvn00910,Nitrogen metabolism,10,0.7752533,1.614827,0.008414069,0.2092344,0.1874827,1396,"tags=50%, list=10%, signal=45%",111134700/111100398/111100399/111126492/111135592,3
25,cvn00562,Inositol phosphate metabolism,47,0.4705413,1.303501,0.100540541,0.4679002,0.4192581,2592,"tags=28%, list=19%, signal=22%",111101050/111127799/111125442/111125100/111127562/111126338/111100148/111135914/111134544/111120505/111135557/111138290/111130239,3
26,cvn04070,Phosphatidylinositol signaling system,55,0.4596492,1.297387,0.098606645,0.4679002,0.4192581,2592,"tags=27%, list=19%, signal=22%",111101050/111127799/111125442/111100277/111125100/111126338/111100148/111135914/111134544/111120505/111135557/111122823/111128823/111138290/111130239,3


now that I have a df with counts of number of matches between core enrichment genes and significant DMGs, want to only look at those with matches (filter out any pathways that did not contain sig. DMGs in their core enrichment)

In [10]:
# only want to look at pathways that have significant genes in their core enrichment
matched_pathways <- filter(gene_pathway_match, gene_pathway_match$MatchesCount != 0)

# checking dimensions to see how many pathways we have 
dim(matched_pathways) # 41 matches

# looking at df
head(matched_pathways)

Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment,MatchesCount
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>,<int>
1,cvn04814,Motor proteins,111,0.4819493,1.450226,0.005775221,0.2092344,0.1874827,1180,"tags=18%, list=9%, signal=17%",111136151/111134768/111107338/111102596/111112439/111103394/111115784/111129526/111107250/111127380/111134843/111137068/111131563/111119946/111134888/111120500/111129376/111130940/111125250/111131555,7
2,cvn04144,Endocytosis,132,0.4453853,1.356744,0.018375737,0.2223464,0.1992317,3187,"tags=33%, list=24%, signal=25%",111120187/111125099/111112319/111119513/111112439/111119512/111136896/111104852/111102907/111107174/111136866/111125956/111134954/111112863/111116971/111112700/111135084/111133388/111134171/111121253/111115795/111127289/111129312/111134242/111121335/111121437/111129503/111135594/111106223/111123210/111125223/111105462/111104835/111104585/111104028/111133563/111105923/111119177/111101822/111104196/111137900/111112119/111123772,5
3,cvn03040,Spliceosome,92,0.3987387,1.182876,0.171954964,0.5794411,0.5192034,1558,"tags=16%, list=12%, signal=15%",111112733/111137770/111119513/111119512/111129112/111133954/111121854/111121021/111118318/111119442/111134531/111114893/111136440/111135640/111136164,4
4,cvn00910,Nitrogen metabolism,10,0.7752533,1.614827,0.008414069,0.2092344,0.1874827,1396,"tags=50%, list=10%, signal=45%",111134700/111100398/111100399/111126492/111135592,3
5,cvn00562,Inositol phosphate metabolism,47,0.4705413,1.303501,0.100540541,0.4679002,0.4192581,2592,"tags=28%, list=19%, signal=22%",111101050/111127799/111125442/111125100/111127562/111126338/111100148/111135914/111134544/111120505/111135557/111138290/111130239,3
6,cvn04070,Phosphatidylinositol signaling system,55,0.4596492,1.297387,0.098606645,0.4679002,0.4192581,2592,"tags=27%, list=19%, signal=22%",111101050/111127799/111125442/111100277/111125100/111126338/111100148/111135914/111134544/111120505/111135557/111122823/111128823/111138290/111130239,3


In [11]:
write.csv(matched_pathways, '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/lfc_kegg_pathways/p1wc_siggene_pathways.csv')

In [None]:
mean(matched_pathways$MatchesCount)
median(matched_pathways$MatchesCount)
sd(matched_pathways$MatchesCount)

**Stats on Matched Counts**

25 pathways with only 1 match

148 pathways with 0 matches

- mean number of matches: 1.8
- median number of matches: 1
- standard deviation of matches: 1.31

**motor proteins** have the highest number of significant DMGs in its core enrichment with 7 genes

## Investigating adjusted p-values with NA values
We're trying to figure out *why* we get NA values so we can decide if we want to exclude or keep those genes in our pathway analysis

Starting with looking at our list of genes with stat info from DESeq (normal lfcShrink and lfcThreshold=0.5)

In [None]:
unfilter_genes <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/lfc_sig_genes/unfiltered_phase1_wc_genes.csv')
head(unfilter_genes)

filtering df to only include genes with NA for adjusted p-value

In [None]:
genes_w_na <- unfilter_genes[is.na(unfilter_genes$padj),]
dim(genes_w_na) # 5,776 genes with NA for adjusted p-value

looking more into the stats of those genes with NA for adjusted p-value, specifically looking at the baseMean (since this is what the DESeq documentation points at)
> baseMean - the average of the normalized count values, dividing by size factors, taken over all samples

In [None]:
mean(genes_w_na$baseMean) # average = 4.5 counts per sample
sd(genes_w_na$baseMean) # standard deviation = 22

loading in counts matrix that was generated with featureCounts to pull out the genes with NA for adjusted p-value

In [None]:
# loading in counts matrix
counts_matrix <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/counts_and_meta/featureCounts_meta.csv')
head(counts_matrix)

In [None]:
# setting genees as row names
rownames(counts_matrix) = counts_matrix$X

# removing gene column (since now are rownames)
counts_matrix2 <- counts_matrix[,-1]
head(counts_matrix2)

In [None]:
# replace the '.' with '-'
cleaned_column_names2 <- gsub('\\.', "-", colnames(counts_matrix2))
head(cleaned_column_names2)

# now assigning to the columns
colnames(counts_matrix2) = cleaned_column_names2
head(counts_matrix2)

In [None]:
# creating new df of counts matrix of genes with NA for padj
counts_na <- counts_matrix2[rownames(counts_matrix2) %in% genes_w_na$X,]
dim(counts_na) # 5,776 genes 
head(counts_na)

also need to load in meta data so that I can pull out the right columns aka samples that were warm or control for phase 1

In [None]:
meta <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/counts_and_meta/CV_CE18_meta.csv')
head(meta)

pulling out only the samples that were either warm or control for phase 1 treatment (we're ignoring the effects of phase 2 for this analysis)

In [None]:
p1_wc_meta <- filter(meta, meta$Phase1 == 'warm' | meta$Phase1 == 'control')
dim(p1_wc_meta) # 15 total samples
head(p1_wc_meta)

In [None]:
p1_wc_meta

In [None]:
# only looking at samples with control or warm for phase 1 in the counts matrix
filtered_df <- counts_na[,colnames(counts_na) %in% p1_wc_meta$unique_ID]
dim(filtered_df) # still seeing 5,776 genes with NA for padj and info for 15 samples
head(filtered_df,20)

all of the **W##** samples are **control** and all of the **G##** samples are **warm** for phase 1

looking at this df in excel to more easily look at why we might get NA for those genes

In [None]:
write.csv(filtered_df, '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/featureCounts_analysis/na_p1_wc_countsmatrix.csv')

# Unfiltered DESeq DF and Pathway
still looking at significantly differentially methylated genes for phase 1 warm vs. control, but now genes get assigned NA for adjusted p-values only when the gene has low counts (so this now includes outliers, instead of assigning outliers NA), therefore, different genes will be kept in the analysis for enriched pathways

In [None]:
# load in csv file
pathway <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/lfc_kegg_pathways/unfiltered_pathways_p1wc.csv')

# clean headers and columns
pathway <- pathway[,-1]

# checking dimensions
dim(pathway) #119 pathways, 11 rows of info/meta data

head(pathway)

In [None]:
# load data frame
genes <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/unfiltered_lfc_p1wc.csv')

# KEGG uses entrez IDs, which are my ensembl IDs without the 'LOC' in front of them, so need to convert those
genes$X <- substr(genes$X, start = 4, stop = nchar(genes$X))

# only grabbing the columns I care about
genes <- select(genes, X, log2FoldChange, padj)

# renaming columns to make more sense
colnames(genes) <- c('gene', 'lfc', 'padj')

# only selecting genes that have padj < 0.05
genes <- filter(genes, genes$padj < 0.05)

# checking dimensions
dim(genes) # 189 sig DMGs

head(genes)

#### III. Are the signficant genes in the core enrichment of the pathways?
Taking our list of siginificant genes, and going line by line in our pathways to see if our sig. gene matches any of the core enrichment genes

In [None]:
# generated from ChatGPT

# Sample dataframes
df1 <- genes
df2 <- pathway

# Function to check if any gene in df1 matches genes in a row of df2
get_gene_matches <- function(row_df2, df1_genes) {
  genes2 <- unlist(strsplit(as.character(row_df2), "/"))
  count <- sum(genes2 %in% df1_genes)
  return(count)
}

# Iterate over each row of df2
matches_count <- sapply(df2$core_enrichment, get_gene_matches, df1_genes = df1$gene)

# Add the matches count to df2
df2$MatchesCount <- matches_count

# Sort df with highest match counts at the top
gene_pathway_match <- df2[order(-df2$MatchesCount),]
head(gene_pathway_match)


now that I have a df with counts of number of matches between core enrichment genes and significant DMGs, want to only look at those with matches (filter out any pathways that did not contain sig. DMGs in their core enrichment)

In [None]:
# only want to look at pathways that have significant genes in their core enrichment
matched_pathways <- filter(gene_pathway_match, gene_pathway_match$MatchesCount != 0)

# checking dimensions to see how many pathways we have 
dim(matched_pathways) # 41 matches

# looking at df
head(matched_pathways)

In [None]:
mean(matched_pathways$MatchesCount)
median(matched_pathways$MatchesCount)
sd(matched_pathways$MatchesCount)

**Stats on Matched Counts**

25 pathways with only 1 match

148 pathways with 0 matches

- mean number of matches: 1.8
- median number of matches: 1
- standard deviation of matches: 1.31

**motor proteins** have the highest number of significant DMGs in its core enrichment with 7 genes

**these results are the *same* as without filtering the outlier counts**

In [None]:
unfilter_genes <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/lfc_sig_genes/unfiltered_phase1_wc_genes.csv')
head(unfilter_genes)

filtering df to only include genes with NA for adjusted p-value

In [None]:
genes_w_na <- unfilter_genes[is.na(unfilter_genes$padj),]
dim(genes_w_na) # 5,776 genes with NA for adjusted p-value

looking more into the stats of those genes with NA for adjusted p-value, specifically looking at the baseMean (since this is what the DESeq documentation points at)
> baseMean - the average of the normalized count values, dividing by size factors, taken over all samples

In [None]:
mean(genes_w_na$baseMean) # average = 4.5 counts per sample
sd(genes_w_na$baseMean) # standard deviation = 22

loading in counts matrix that was generated with featureCounts to pull out the genes with NA for adjusted p-value

In [None]:
# loading in counts matrix
counts_matrix <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/counts_and_meta/featureCounts_meta.csv')
head(counts_matrix)

In [None]:
# setting genees as row names
rownames(counts_matrix) = counts_matrix$X

# removing gene column (since now are rownames)
counts_matrix2 <- counts_matrix[,-1]
head(counts_matrix2)

In [None]:
# replace the '.' with '-'
cleaned_column_names2 <- gsub('\\.', "-", colnames(counts_matrix2))
head(cleaned_column_names2)

# now assigning to the columns
colnames(counts_matrix2) = cleaned_column_names2
head(counts_matrix2)

In [None]:
# creating new df of counts matrix of genes with NA for padj
counts_na <- counts_matrix2[rownames(counts_matrix2) %in% genes_w_na$X,]
dim(counts_na) # 5,776 genes 
head(counts_na)

also need to load in meta data so that I can pull out the right columns aka samples that were warm or control for phase 1

In [None]:
meta <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/counts_and_meta/CV_CE18_meta.csv')
head(meta)

pulling out only the samples that were either warm or control for phase 1 treatment (we're ignoring the effects of phase 2 for this analysis)

In [None]:
p1_wc_meta <- filter(meta, meta$Phase1 == 'warm' | meta$Phase1 == 'control')
dim(p1_wc_meta) # 15 total samples
head(p1_wc_meta)

In [None]:
p1_wc_meta

In [None]:
# only looking at samples with control or warm for phase 1 in the counts matrix
filtered_df <- counts_na[,colnames(counts_na) %in% p1_wc_meta$unique_ID]
dim(filtered_df) # still seeing 5,776 genes with NA for padj and info for 15 samples
head(filtered_df,20)

all of the **W##** samples are **control** and all of the **G##** samples are **warm** for phase 1

looking at this df in excel to more easily look at why we might get NA for those genes

In [None]:
write.csv(filtered_df, '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/featureCounts_analysis/na_p1_wc_countsmatrix.csv')