# Are our significant DMGs the core enrichment genes in our identified enriched pathways?
We want to combine what we've found for statistically signficant differentially methylated genes and connect it to our identified enriched pathways so we can start to make sense of things biologically.

I have generated two csv files for both **phase 1 warm vs. control oysters**:
- phase1_wc_genes.csv - list of significant (adjusted p-value < 0.05) DMGs
- p1_wc_pathway.csv - list of enriched pathways from KEGG

Each enriched pathway contains a list of 'core enrichment genes' - these are a list of genes that are reported as part of the 'core enrichment' and contribute to the observed enrichment score.

The thinking is that maybe some of our signficant DMGs are part of that core enrichment group, which could tell us that that pathway is especially important/biologically relevant.

#### I. Load packages

In [2]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


#### II. Load, clean, and prep both csv files

In [3]:
# load in csv file
pathway <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/lfc_kegg_pathways/p1_wc_pathway.csv')

# clean headers and columns
pathway <- pathway[,-1]

# checking dimensions
dim(pathway) #119 pathways, 11 rows of info/meta data

head(pathway)

Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>
1,cvn00910,Nitrogen metabolism,10,0.7751841,1.619242,0.007876028,0.2056087,0.1873304,1398,"tags=50%, list=10%, signal=45%",111134700/111100398/111100399/111126492/111135592
2,cvn00053,Ascorbate and aldarate metabolism,11,0.7413812,1.583931,0.013924094,0.2056087,0.1873304,2374,"tags=55%, list=18%, signal=45%",111124535/111103451/111124599/111112920/111115614/111103498
3,cvn00052,Galactose metabolism,21,0.6484812,1.582238,0.015374308,0.2056087,0.1873304,2261,"tags=38%, list=17%, signal=32%",111101197/111118471/111101820/111113388/111109442/111099882/111120703/111118006
4,cvn00592,alpha-Linolenic acid metabolism,11,0.7377666,1.576208,0.015550238,0.2056087,0.1873304,1602,"tags=45%, list=12%, signal=40%",111113990/111115744/111107112/111115745/111124908
5,cvn00511,Other glycan degradation,37,0.5917499,1.563906,0.004948574,0.2056087,0.1873304,2221,"tags=30%, list=16%, signal=25%",111106921/111106925/111106928/111119851/111119435/111120040/111113388/111119434/111106926/111119431/111106930
6,cvn03250,Viral life cycle,28,0.6023539,1.532503,0.00806462,0.2056087,0.1873304,2628,"tags=46%, list=19%, signal=37%",111124701/111124696/111129825/111111579/111108190/111135084/111128997/111124977/111106750/111123417/111130886/111104027/111135329


In [4]:
# load data frame
genes <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/lfc_sig_genes/phase1_wc_genes.csv')

# KEGG uses entrez IDs, which are my ensembl IDs without the 'LOC' in front of them, so need to convert those
genes$X <- substr(genes$X, start = 4, stop = nchar(genes$X))

# only grabbing the columns I care about
genes <- select(genes, X, log2FoldChange, padj)

# renaming columns to make more sense
colnames(genes) <- c('gene', 'lfc', 'padj')

# checking dimensions
dim(genes) # 189 sig DMGs

head(genes)

Unnamed: 0_level_0,gene,lfc,padj
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
1,111128103,1.644558,0.0307415459
2,111137770,2.655903,0.0178939261
3,111111295,2.36024,0.0137418025
4,111125391,2.205531,0.0207473298
5,111110197,1.863842,0.0209235911
6,111115675,2.781707,7.92069e-05


#### III. Are the signficant genes in the core enrichment of the pathways?
Taking our list of siginificant genes, and going line by line in our pathways to see if our sig. gene matches any of the core enrichment genes

In [5]:
# generated from ChatGPT

# Sample dataframes
df1 <- genes
df2 <- pathway

# Function to check if any gene in df1 matches genes in a row of df2
get_gene_matches <- function(row_df2, df1_genes) {
  genes2 <- unlist(strsplit(as.character(row_df2), "/"))
  count <- sum(genes2 %in% df1_genes)
  return(count)
}

# Iterate over each row of df2
matches_count <- sapply(df2$core_enrichment, get_gene_matches, df1_genes = df1$gene)

# Add the matches count to df2
df2$MatchesCount <- matches_count

# Sort df with highest match counts at the top
gene_pathway_match <- df2[order(-df2$MatchesCount),]
head(gene_pathway_match)


Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment,MatchesCount
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>,<int>
9,cvn04814,Motor proteins,110,0.4860525,1.448496,0.008328566,0.2056087,0.1873304,1179,"tags=18%, list=9%, signal=17%",111136151/111134768/111107338/111102596/111112439/111103394/111115784/111129526/111107250/111127380/111134843/111137068/111131563/111119946/111134888/111120500/111129376/111130940/111125250/111131555,7
15,cvn04144,Endocytosis,130,0.4435434,1.329267,0.03058104,0.2426096,0.2210419,3185,"tags=32%, list=24%, signal=25%",111120187/111125099/111112319/111119513/111112439/111119512/111136896/111104852/111102907/111107174/111136866/111125956/111134954/111112863/111116971/111112700/111135084/111133388/111134171/111121253/111115795/111127289/111129312/111134242/111121437/111129503/111135594/111106223/111123210/111125223/111105462/111104835/111104585/111104028/111133563/111105923/111119177/111101822/111104196/111137900/111112119/111123772,5
41,cvn03040,Spliceosome,91,0.4067843,1.195986,0.179752066,0.5397379,0.4917559,1557,"tags=16%, list=11%, signal=15%",111112733/111137770/111119513/111119512/111129112/111133954/111121854/111121021/111118318/111119442/111134531/111114893/111136440/111135640/111136164,4
1,cvn00910,Nitrogen metabolism,10,0.7751841,1.619242,0.007876028,0.2056087,0.1873304,1398,"tags=50%, list=10%, signal=45%",111134700/111100398/111100399/111126492/111135592,3
30,cvn04070,Phosphatidylinositol signaling system,55,0.4596708,1.2838,0.115015974,0.5397379,0.4917559,2592,"tags=27%, list=19%, signal=22%",111101050/111127799/111125442/111100277/111125100/111126338/111100148/111135914/111134544/111120505/111135557/111122823/111128823/111138290/111130239,3
34,cvn00562,Inositol phosphate metabolism,46,0.454872,1.242543,0.15536105,0.5397379,0.4917559,2592,"tags=26%, list=19%, signal=21%",111101050/111127799/111125442/111125100/111126338/111100148/111135914/111134544/111120505/111135557/111138290/111130239,3


now that I have a df with counts of number of matches between core enrichment genes and significant DMGs, want to only look at those with matches (filter out any pathways that did not contain sig. DMGs in their core enrichment)

In [8]:
# only want to look at pathways that have significant genes in their core enrichment
matched_pathways <- filter(gene_pathway_match, gene_pathway_match$MatchesCount != 0)

# checking dimensions to see how many pathways we have 
dim(matched_pathways) # 41 matches

# looking at df
head(matched_pathways)

Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment,MatchesCount
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>,<int>
1,cvn04814,Motor proteins,110,0.4860525,1.448496,0.008328566,0.2056087,0.1873304,1179,"tags=18%, list=9%, signal=17%",111136151/111134768/111107338/111102596/111112439/111103394/111115784/111129526/111107250/111127380/111134843/111137068/111131563/111119946/111134888/111120500/111129376/111130940/111125250/111131555,7
2,cvn04144,Endocytosis,130,0.4435434,1.329267,0.03058104,0.2426096,0.2210419,3185,"tags=32%, list=24%, signal=25%",111120187/111125099/111112319/111119513/111112439/111119512/111136896/111104852/111102907/111107174/111136866/111125956/111134954/111112863/111116971/111112700/111135084/111133388/111134171/111121253/111115795/111127289/111129312/111134242/111121437/111129503/111135594/111106223/111123210/111125223/111105462/111104835/111104585/111104028/111133563/111105923/111119177/111101822/111104196/111137900/111112119/111123772,5
3,cvn03040,Spliceosome,91,0.4067843,1.195986,0.179752066,0.5397379,0.4917559,1557,"tags=16%, list=11%, signal=15%",111112733/111137770/111119513/111119512/111129112/111133954/111121854/111121021/111118318/111119442/111134531/111114893/111136440/111135640/111136164,4
4,cvn00910,Nitrogen metabolism,10,0.7751841,1.619242,0.007876028,0.2056087,0.1873304,1398,"tags=50%, list=10%, signal=45%",111134700/111100398/111100399/111126492/111135592,3
5,cvn04070,Phosphatidylinositol signaling system,55,0.4596708,1.2838,0.115015974,0.5397379,0.4917559,2592,"tags=27%, list=19%, signal=22%",111101050/111127799/111125442/111100277/111125100/111126338/111100148/111135914/111134544/111120505/111135557/111122823/111128823/111138290/111130239,3
6,cvn00562,Inositol phosphate metabolism,46,0.454872,1.242543,0.15536105,0.5397379,0.4917559,2592,"tags=26%, list=19%, signal=21%",111101050/111127799/111125442/111125100/111126338/111100148/111135914/111134544/111120505/111135557/111138290/111130239,3


In [7]:
mean(matched_pathways$MatchesCount)
median(matched_pathways$MatchesCount)
sd(matched_pathways$MatchesCount)

**Stats on Matched Counts**

25 pathways with only 1 match

148 pathways with 0 matches

- mean number of matches: 1.8
- median number of matches: 1
- standard deviation of matches: 1.31

**motor proteins** have the highest number of significant DMGs in its core enrichment with 7 genes