# Are our significant DMGs the core enrichment genes in our identified enriched pathways?
We want to combine what we've found for statistically signficant differentially methylated genes and connect it to our identified enriched pathways so we can start to make sense of things biologically.

I have generated two csv files for both **phase 1 warm vs. control oysters**:
- phase1_wc_genes.csv - list of significant (adjusted p-value < 0.05) DMGs
- p1_wc_pathway.csv - list of enriched pathways from KEGG

Each enriched pathway contains a list of 'core enrichment genes' - these are a list of genes that are reported as part of the 'core enrichment' and contribute to the observed enrichment score.

The thinking is that maybe some of our signficant DMGs are part of that core enrichment group, which could tell us that that pathway is especially important/biologically relevant.

#### I. Load packages

In [2]:
library(tidyverse)

#### II. Load, clean, and prep both csv files
starting with **phase 1 warm vs. control**

In [4]:
# load in csv file
pathway <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/enriched_pathways/pathways_p1wc.csv')

# clean headers and columns
pathway <- pathway[,-1]

# checking dimensions
dim(pathway) #121 pathways, 11 rows of info/meta data

head(pathway)

Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>
1,cvn00053,Ascorbate and aldarate metabolism,12,0.7513381,1.627009,0.00618654,0.2537983,0.2296218,2374,"tags=58%, list=18%, signal=48%",111124535/111103451/111124599/111127562/111112920/111115614/111103498
2,cvn00910,Nitrogen metabolism,10,0.7751841,1.611862,0.01095923,0.2537983,0.2296218,1398,"tags=50%, list=10%, signal=45%",111134700/111100398/111100399/111126492/111135592
3,cvn00511,Other glycan degradation,37,0.5917499,1.578985,0.01320066,0.2537983,0.2296218,2221,"tags=30%, list=16%, signal=25%",111106921/111106925/111106928/111119851/111119435/111120040/111113388/111119434/111106926/111119431/111106930
4,cvn00592,alpha-Linolenic acid metabolism,11,0.7377666,1.564501,0.01887756,0.2537983,0.2296218,1602,"tags=45%, list=12%, signal=40%",111113990/111115744/111107112/111115745/111124908
5,cvn03250,Viral life cycle - HIV-1,28,0.6023539,1.533878,0.0150972,0.2537983,0.2296218,2628,"tags=46%, list=19%, signal=37%",111124701/111124696/111129825/111111579/111108190/111135084/111128997/111124977/111106750/111123417/111130886/111104027/111135329
6,cvn00380,Tryptophan metabolism,31,0.583848,1.517557,0.01807548,0.2537983,0.2296218,2374,"tags=42%, list=18%, signal=35%",111127901/111100724/111103451/111134248/111133558/111112920/111115614/111121380/111125148/111109254/111130627/111108303/111103498


I want to know if all of my genes provided in geneList are in the core enrichment, or if the core enrichment only includes a subset of my total genes

- total genes in my geneList for KEGG - around 13,000 genes
- total enriched pathways - 121
- total number of significant DMGs - 189
- total number of unique genes in the core_enrichment column - **799** (from code below)

since this number (799) does not match my total genes in the data set, I know that there are genes in the core enrichment that I did not find to be significantly differentially methylated

In [5]:
# Split the strings in 'genes_column' into lists
gene_lists <- strsplit(pathway$core_enrichment, '/')

# Flatten the lists
all_genes <- unlist(gene_lists)

# Count the unique genes
unique_genes_count <- length(unique(all_genes))

print(paste("Number of unique genes in the entire DataFrame:", unique_genes_count))


[1] "Number of unique genes in the entire DataFrame: 799"


loading in data frame that contains significant genes info for **phase 1 warm vs. control**

In [7]:
# load data frame
genes <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/significant_genes/volcano_data/phase1_wc_genes.csv')

# KEGG uses entrez IDs, which are my ensembl IDs without the 'LOC' in front of them, so need to convert those
genes$X <- substr(genes$X, start = 4, stop = nchar(genes$X))

# only grabbing the columns I care about
genes <- select(genes, X, log2FoldChange, padj)

# renaming columns to make more sense
colnames(genes) <- c('gene', 'lfc', 'padj')

# only selecting significant genes
genes <- filter(genes, genes$padj < 0.05)

# checking dimensions
dim(genes) # 344 sig DMGs

head(genes)

Unnamed: 0_level_0,gene,lfc,padj
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
1,111117672,1.314266,0.005811498
2,111128103,1.644558,0.001509994
3,111137770,2.655903,0.002201055
4,111125333,1.461069,0.017320572
5,111111295,2.36024,0.001233208
6,111125391,2.205531,0.001875643


#### III. Are the signficant genes in the core enrichment of the pathways?
Taking our list of siginificant genes, and going line by line in our pathways to see if our sig. gene matches any of the core enrichment genes

the code below will add a column to my dataframe that contains the number of significant DMGs that match the core enrichment for that pathway

In [8]:
# generated from ChatGPT

# Sample dataframes
df1 <- genes
df2 <- pathway

# Function to check if any gene in df1 matches genes in a row of df2
get_gene_matches <- function(row_df2, df1_genes) {
  genes2 <- unlist(strsplit(as.character(row_df2), "/"))
  count <- sum(genes2 %in% df1_genes)
  return(count)
}

# Iterate over each row of df2
matches_count <- sapply(df2$core_enrichment, get_gene_matches, df1_genes = df1$gene)

# Add the matches count to df2
df2$MatchesCount <- matches_count

# Sort df with highest match counts at the top
gene_pathway_match <- df2[order(-df2$MatchesCount),]
head(gene_pathway_match)


Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment,MatchesCount
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>,<int>
9,cvn04814,Motor proteins,111,0.4825362,1.425442,0.007216891,0.2537983,0.2296218,1179,"tags=18%, list=9%, signal=17%",111136151/111134768/111107338/111102596/111112439/111103394/111115784/111129526/111107250/111127380/111134843/111137068/111131563/111119946/111134888/111120500/111129376/111130940/111125250/111131555,8
13,cvn04144,Endocytosis,132,0.4453777,1.329073,0.027355623,0.2546177,0.2303631,3185,"tags=33%, list=24%, signal=25%",111120187/111125099/111112319/111119513/111112439/111119512/111136896/111104852/111102907/111107174/111136866/111125956/111134954/111112863/111116971/111112700/111135084/111133388/111134171/111121253/111115795/111127289/111129312/111134242/111121335/111121437/111129503/111135594/111106223/111123210/111125223/111105462/111104835/111104585/111104028/111133563/111105923/111119177/111101822/111104196/111137900/111112119/111123772,8
8,cvn00071,Fatty acid degradation,38,0.5657277,1.510768,0.017560053,0.2537983,0.2296218,3018,"tags=42%, list=22%, signal=33%",111127947/111113990/111128659/111115744/111107112/111115745/111103451/111112920/111115614/111135553/111121380/111117093/111103498/111129333/111107779/111134458,5
45,cvn04148,Efferocytosis,98,0.4000467,1.167589,0.194501018,0.6325836,0.5723244,1406,"tags=14%, list=10%, signal=13%",111135946/111113124/111121723/111134726/111117970/111110712/111134431/111128659/111117731/111134947/111130916/111127576/111130408/111112330,5
46,cvn03040,Spliceosome,92,0.3988463,1.153711,0.216049383,0.6325836,0.5723244,1557,"tags=16%, list=11%, signal=15%",111112733/111137770/111119513/111119512/111129112/111133954/111121854/111121021/111118318/111119442/111134531/111114893/111136440/111135640/111136164,5
4,cvn00592,alpha-Linolenic acid metabolism,11,0.7377666,1.564501,0.018877557,0.2537983,0.2296218,1602,"tags=45%, list=12%, signal=40%",111113990/111115744/111107112/111115745/111124908,4


above returns just the ***number*** of significant genes in the core enrichment for each pathway

below returns the ***percent*** of significant genes in the core enrichment

it's interesting that the percent returns a much different top 5 ... maybe this is the better way to do it?

In [11]:
# Sample dataframes
df1 <- genes
df2 <- pathway

# Function to check if any gene in df1 matches genes in a row of df2
get_gene_matches <- function(row_df2, df1_genes) {
  genes2 <- unlist(strsplit(as.character(row_df2), "/"))
  count <- sum(genes2 %in% df1_genes)
  # Calculate the percentage of matched genes in the row
  percentage <- count / length(genes2) * 100
  return(percentage)
}

# Iterate over each row of df2
matches_percentage <- sapply(df2$core_enrichment, get_gene_matches, df1_genes = df1$gene)

# Add the matches percentage to df2
df2$MatchesPercentage <- matches_percentage

# Sort df with highest match percentages at the top
gene_pathway_match <- df2[order(-df2$MatchesPercentage),]
head(gene_pathway_match, 10)


Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment,MatchesPercentage
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>,<dbl>
4,cvn00592,alpha-Linolenic acid metabolism,11,0.7377666,1.5645014,0.018877557,0.2537983,0.2296218,1602,"tags=45%, list=12%, signal=40%",111113990/111115744/111107112/111115745/111124908,80.0
15,cvn01040,Biosynthesis of unsaturated fatty acids,18,0.6354291,1.4894918,0.033088235,0.2669118,0.2414861,1602,"tags=33%, list=12%, signal=29%",111113990/111115744/111107112/111115745/111119293/111124908,66.66667
2,cvn00910,Nitrogen metabolism,10,0.7751841,1.6118619,0.010959229,0.2537983,0.2296218,1398,"tags=50%, list=10%, signal=45%",111134700/111100398/111100399/111126492/111135592,60.0
32,cvn04512,ECM-receptor interaction,23,0.511361,1.273773,0.176331361,0.6096027,0.5515327,1679,"tags=26%, list=12%, signal=23%",111114114/111111291/111113265/111134722/111109289/111119175,50.0
71,cvn03083,Polycomb repressive complex,38,0.3891222,1.0391458,0.438264739,0.7377131,0.6674394,865,"tags=11%, list=6%, signal=10%",111123912/111104330/111104315/111123056,50.0
87,cvn00532,Glycosaminoglycan biosynthesis - chondroitin sulfate / dermatan sulfate,12,0.4314562,0.9343104,0.569148936,0.791575,0.7161705,1400,"tags=17%, list=10%, signal=15%",111118293/111111504,50.0
94,cvn03430,Mismatch repair,21,0.355311,0.8641842,0.647272727,0.8331915,0.7538227,330,"tags=10%, list=2%, signal=9%",111121171/111137687,50.0
9,cvn04814,Motor proteins,111,0.4825362,1.425442,0.007216891,0.2537983,0.2296218,1179,"tags=18%, list=9%, signal=17%",111136151/111134768/111107338/111102596/111112439/111103394/111115784/111129526/111107250/111127380/111134843/111137068/111131563/111119946/111134888/111120500/111129376/111130940/111125250/111131555,40.0
45,cvn04148,Efferocytosis,98,0.4000467,1.1675889,0.194501018,0.6325836,0.5723244,1406,"tags=14%, list=10%, signal=13%",111135946/111113124/111121723/111134726/111117970/111110712/111134431/111128659/111117731/111134947/111130916/111127576/111130408/111112330,35.71429
22,cvn01212,Fatty acid metabolism,48,0.4857861,1.3384466,0.079569892,0.4186068,0.3787308,2029,"tags=25%, list=15%, signal=21%",111127947/111113990/111128659/111115744/111107112/111115745/111099995/111119293/111135553/111124908/111100523/111117093,33.33333


now that I have a df with counts of number of matches between core enrichment genes and significant DMGs, want to only look at those with matches (filter out any pathways that did not contain sig. DMGs in their core enrichment)

In [14]:
# only want to look at pathways that have significant genes in their core enrichment
matched_pathways <- filter(gene_pathway_match, gene_pathway_match$MatchesPercentage != 0)

# checking dimensions to see how many pathways we have 
dim(matched_pathways) # 77 matches

# looking at df
head(matched_pathways)

Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment,MatchesPercentage
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>,<dbl>
1,cvn00592,alpha-Linolenic acid metabolism,11,0.7377666,1.5645014,0.01887756,0.2537983,0.2296218,1602,"tags=45%, list=12%, signal=40%",111113990/111115744/111107112/111115745/111124908,80.0
2,cvn01040,Biosynthesis of unsaturated fatty acids,18,0.6354291,1.4894918,0.03308824,0.2669118,0.2414861,1602,"tags=33%, list=12%, signal=29%",111113990/111115744/111107112/111115745/111119293/111124908,66.66667
3,cvn00910,Nitrogen metabolism,10,0.7751841,1.6118619,0.01095923,0.2537983,0.2296218,1398,"tags=50%, list=10%, signal=45%",111134700/111100398/111100399/111126492/111135592,60.0
4,cvn04512,ECM-receptor interaction,23,0.511361,1.273773,0.17633136,0.6096027,0.5515327,1679,"tags=26%, list=12%, signal=23%",111114114/111111291/111113265/111134722/111109289/111119175,50.0
5,cvn03083,Polycomb repressive complex,38,0.3891222,1.0391458,0.43826474,0.7377131,0.6674394,865,"tags=11%, list=6%, signal=10%",111123912/111104330/111104315/111123056,50.0
6,cvn00532,Glycosaminoglycan biosynthesis - chondroitin sulfate / dermatan sulfate,12,0.4314562,0.9343104,0.56914894,0.791575,0.7161705,1400,"tags=17%, list=10%, signal=15%",111118293/111111504,50.0


In [11]:
write.csv(matched_pathways, '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/lfc_kegg_pathways/p1wc_siggene_pathways.csv')

In [17]:
mean(matched_pathways$MatchesPercentage)
median(matched_pathways$MatchesPercentage)
sd(matched_pathways$MatchesPercentage)

**Stats on Matched Percentage**

25 pathways with only 1 match

267 pathways with 0 matches

- mean % of matches: 21%
- median % of matches: 18%
- standard deviation of matches: 14%

**alpha-Linolenic acid metabolism** have the highest percent of significant DMGs in its core enrichment with 80%

## Investigating adjusted p-values with NA values
We're trying to figure out *why* we get NA values so we can decide if we want to exclude or keep those genes in our pathway analysis

Starting with looking at our list of genes with stat info from DESeq (normal lfcShrink and lfcThreshold=0.5)

In [None]:
unfilter_genes <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/lfc_sig_genes/unfiltered_phase1_wc_genes.csv')
head(unfilter_genes)

filtering df to only include genes with NA for adjusted p-value

In [None]:
genes_w_na <- unfilter_genes[is.na(unfilter_genes$padj),]
dim(genes_w_na) # 5,776 genes with NA for adjusted p-value

looking more into the stats of those genes with NA for adjusted p-value, specifically looking at the baseMean (since this is what the DESeq documentation points at)
> baseMean - the average of the normalized count values, dividing by size factors, taken over all samples

In [None]:
mean(genes_w_na$baseMean) # average = 4.5 counts per sample
sd(genes_w_na$baseMean) # standard deviation = 22

loading in counts matrix that was generated with featureCounts to pull out the genes with NA for adjusted p-value

In [None]:
# loading in counts matrix
counts_matrix <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/counts_and_meta/featureCounts_meta.csv')
head(counts_matrix)

In [None]:
# setting genees as row names
rownames(counts_matrix) = counts_matrix$X

# removing gene column (since now are rownames)
counts_matrix2 <- counts_matrix[,-1]
head(counts_matrix2)

In [None]:
# replace the '.' with '-'
cleaned_column_names2 <- gsub('\\.', "-", colnames(counts_matrix2))
head(cleaned_column_names2)

# now assigning to the columns
colnames(counts_matrix2) = cleaned_column_names2
head(counts_matrix2)

In [None]:
# creating new df of counts matrix of genes with NA for padj
counts_na <- counts_matrix2[rownames(counts_matrix2) %in% genes_w_na$X,]
dim(counts_na) # 5,776 genes 
head(counts_na)

also need to load in meta data so that I can pull out the right columns aka samples that were warm or control for phase 1

In [None]:
meta <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/counts_and_meta/CV_CE18_meta.csv')
head(meta)

pulling out only the samples that were either warm or control for phase 1 treatment (we're ignoring the effects of phase 2 for this analysis)

In [None]:
p1_wc_meta <- filter(meta, meta$Phase1 == 'warm' | meta$Phase1 == 'control')
dim(p1_wc_meta) # 15 total samples
head(p1_wc_meta)

In [None]:
p1_wc_meta

In [None]:
# only looking at samples with control or warm for phase 1 in the counts matrix
filtered_df <- counts_na[,colnames(counts_na) %in% p1_wc_meta$unique_ID]
dim(filtered_df) # still seeing 5,776 genes with NA for padj and info for 15 samples
head(filtered_df,20)

all of the **W##** samples are **control** and all of the **G##** samples are **warm** for phase 1

looking at this df in excel to more easily look at why we might get NA for those genes

In [None]:
write.csv(filtered_df, '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/featureCounts_analysis/na_p1_wc_countsmatrix.csv')

# Unfiltered DESeq DF and Pathway
still looking at significantly differentially methylated genes for phase 1 warm vs. control, but now genes get assigned NA for adjusted p-values only when the gene has low counts (so this now includes outliers, instead of assigning outliers NA), therefore, different genes will be kept in the analysis for enriched pathways

In [3]:
# load in csv file
pathway <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/enriched_pathways/unfiltered_pathways_p1wc.csv')

# clean headers and columns
pathway <- pathway[,-1]

# checking dimensions
dim(pathway) #121 pathways, 11 rows of info/meta data

head(pathway)

Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>
1,cvn00053,Ascorbate and aldarate metabolism,12,0.7513381,1.647357,0.005464596,0.191996,0.1703662,2374,"tags=58%, list=18%, signal=48%",111124535/111103451/111124599/111127562/111112920/111115614/111103498
2,cvn00910,Nitrogen metabolism,10,0.7751841,1.635015,0.005504478,0.191996,0.1703662,1398,"tags=50%, list=10%, signal=45%",111134700/111100398/111100399/111126492/111135592
3,cvn00511,Other glycan degradation,37,0.5917499,1.592294,0.011186436,0.191996,0.1703662,2221,"tags=30%, list=16%, signal=25%",111106921/111106925/111106928/111119851/111119435/111120040/111113388/111119434/111106926/111119431/111106930
4,cvn00052,Galactose metabolism,21,0.6484812,1.587363,0.01107208,0.191996,0.1703662,2261,"tags=38%, list=17%, signal=32%",111101197/111118471/111101820/111113388/111109442/111099882/111120703/111118006
5,cvn00592,alpha-Linolenic acid metabolism,11,0.7377666,1.578957,0.017454181,0.191996,0.1703662,1602,"tags=45%, list=12%, signal=40%",111113990/111115744/111107112/111115745/111124908
6,cvn03250,Viral life cycle - HIV-1,28,0.6023539,1.547657,0.017181028,0.191996,0.1703662,2628,"tags=46%, list=19%, signal=37%",111124701/111124696/111129825/111111579/111108190/111135084/111128997/111124977/111106750/111123417/111130886/111104027/111135329


In [None]:
# load data frame
genes <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/unfiltered_lfc_p1wc.csv')

# KEGG uses entrez IDs, which are my ensembl IDs without the 'LOC' in front of them, so need to convert those
genes$X <- substr(genes$X, start = 4, stop = nchar(genes$X))

# only grabbing the columns I care about
genes <- select(genes, X, log2FoldChange, padj)

# renaming columns to make more sense
colnames(genes) <- c('gene', 'lfc', 'padj')

# only selecting genes that have padj < 0.05
genes <- filter(genes, genes$padj < 0.05)

# checking dimensions
dim(genes) # 189 sig DMGs

head(genes)

#### III. Are the signficant genes in the core enrichment of the pathways?
Taking our list of siginificant genes, and going line by line in our pathways to see if our sig. gene matches any of the core enrichment genes

In [None]:
# generated from ChatGPT

# Sample dataframes
df1 <- genes
df2 <- pathway

# Function to check if any gene in df1 matches genes in a row of df2
get_gene_matches <- function(row_df2, df1_genes) {
  genes2 <- unlist(strsplit(as.character(row_df2), "/"))
  count <- sum(genes2 %in% df1_genes)
  return(count)
}

# Iterate over each row of df2
matches_count <- sapply(df2$core_enrichment, get_gene_matches, df1_genes = df1$gene)

# Add the matches count to df2
df2$MatchesCount <- matches_count

# Sort df with highest match counts at the top
gene_pathway_match <- df2[order(-df2$MatchesCount),]
head(gene_pathway_match)


now that I have a df with counts of number of matches between core enrichment genes and significant DMGs, want to only look at those with matches (filter out any pathways that did not contain sig. DMGs in their core enrichment)

In [None]:
# only want to look at pathways that have significant genes in their core enrichment
matched_pathways <- filter(gene_pathway_match, gene_pathway_match$MatchesCount != 0)

# checking dimensions to see how many pathways we have 
dim(matched_pathways) # 41 matches

# looking at df
head(matched_pathways)

In [None]:
mean(matched_pathways$MatchesCount)
median(matched_pathways$MatchesCount)
sd(matched_pathways$MatchesCount)

**Stats on Matched Counts**

25 pathways with only 1 match

148 pathways with 0 matches

- mean number of matches: 1.8
- median number of matches: 1
- standard deviation of matches: 1.31

**motor proteins** have the highest number of significant DMGs in its core enrichment with 7 genes

**these results are the *same* as without filtering the outlier counts**

In [None]:
unfilter_genes <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/lfc_sig_genes/unfiltered_phase1_wc_genes.csv')
head(unfilter_genes)

filtering df to only include genes with NA for adjusted p-value

In [None]:
genes_w_na <- unfilter_genes[is.na(unfilter_genes$padj),]
dim(genes_w_na) # 5,776 genes with NA for adjusted p-value

looking more into the stats of those genes with NA for adjusted p-value, specifically looking at the baseMean (since this is what the DESeq documentation points at)
> baseMean - the average of the normalized count values, dividing by size factors, taken over all samples

In [None]:
mean(genes_w_na$baseMean) # average = 4.5 counts per sample
sd(genes_w_na$baseMean) # standard deviation = 22

loading in counts matrix that was generated with featureCounts to pull out the genes with NA for adjusted p-value

In [None]:
# loading in counts matrix
counts_matrix <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/counts_and_meta/featureCounts_meta.csv')
head(counts_matrix)

In [None]:
# setting genees as row names
rownames(counts_matrix) = counts_matrix$X

# removing gene column (since now are rownames)
counts_matrix2 <- counts_matrix[,-1]
head(counts_matrix2)

In [None]:
# replace the '.' with '-'
cleaned_column_names2 <- gsub('\\.', "-", colnames(counts_matrix2))
head(cleaned_column_names2)

# now assigning to the columns
colnames(counts_matrix2) = cleaned_column_names2
head(counts_matrix2)

In [None]:
# creating new df of counts matrix of genes with NA for padj
counts_na <- counts_matrix2[rownames(counts_matrix2) %in% genes_w_na$X,]
dim(counts_na) # 5,776 genes 
head(counts_na)

also need to load in meta data so that I can pull out the right columns aka samples that were warm or control for phase 1

In [None]:
meta <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/counts_and_meta/CV_CE18_meta.csv')
head(meta)

pulling out only the samples that were either warm or control for phase 1 treatment (we're ignoring the effects of phase 2 for this analysis)

In [None]:
p1_wc_meta <- filter(meta, meta$Phase1 == 'warm' | meta$Phase1 == 'control')
dim(p1_wc_meta) # 15 total samples
head(p1_wc_meta)

In [None]:
p1_wc_meta

In [None]:
# only looking at samples with control or warm for phase 1 in the counts matrix
filtered_df <- counts_na[,colnames(counts_na) %in% p1_wc_meta$unique_ID]
dim(filtered_df) # still seeing 5,776 genes with NA for padj and info for 15 samples
head(filtered_df,20)

all of the **W##** samples are **control** and all of the **G##** samples are **warm** for phase 1

looking at this df in excel to more easily look at why we might get NA for those genes

In [None]:
write.csv(filtered_df, '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/featureCounts_analysis/na_p1_wc_countsmatrix.csv')

# phase 1 = hypoxia, phase 2 = control or hypoxia
also referred to as phase 1 carry over effects

In [15]:
# load in csv file
pathway <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/enriched_pathways/p1hyp_p2hc_pathways.csv')

# clean headers and columns
pathway <- pathway[,-1]

# checking dimensions
dim(pathway) #121 pathways, 11 rows of info/meta data

head(pathway)

Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>
1,cvn00270,Cysteine and methionine metabolism,43,0.5220963,1.717776,0.003026136,0.3661625,0.3631363,1555,"tags=30%, list=12%, signal=27%",111137596/111122141/111135192/111129934/111136621/111122163/111110831/111130865/111100699/111133693/111106176/111111318/111116065
2,cvn00592,alpha-Linolenic acid metabolism,11,-0.6794618,-1.741109,0.009046136,0.4181536,0.4146978,2277,"tags=73%, list=17%, signal=60%",111123661/111124908/111136066/111136438/111107112/111115744/111113990/111127642
3,cvn03082,ATP-dependent chromatin remodeling,69,-0.3769411,-1.506744,0.010367445,0.4181536,0.4146978,2797,"tags=41%, list=21%, signal=32%",111120915/111120504/111136148/111135329/111132974/111118535/111130322/111119035/111099792/111114842/111105716/111134187/111123066/111128754/111130152/111125973/111118359/111108477/111129852/111128560/111120856/111120594/111127973/111127274/111114783/111128559/111105834/111133731
4,cvn01040,Biosynthesis of unsaturated fatty acids,18,-0.5761533,-1.696737,0.016592268,0.5019161,0.497768,2054,"tags=50%, list=15%, signal=42%",111124908/111136066/111136438/111131209/111107112/111115744/111113990/111119293/111129730
5,cvn00100,Steroid biosynthesis,10,0.6506083,1.480972,0.053763441,0.5681397,0.5634443,1767,"tags=30%, list=13%, signal=26%",111134862/111134947/111112479
6,cvn00510,N-Glycan biosynthesis,33,-0.4283362,-1.478479,0.061039803,0.5681397,0.5634443,3156,"tags=45%, list=24%, signal=35%",111113415/111118581/111119558/111101820/111125632/111124588/111136555/111101197/111122131/111124498/111134828/111121994/111137033/111136571/111126213


loading in data frame that contains significant genes info for **phase 1 hypoxic, phase 2 is hypoxix or control**

In [16]:
# load data frame
genes <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/significant_genes/volcano_data/phase1_ce_genes.csv')

# only want data for samples that were hypoxic for phase 1
genes_h <- filter(genes, genes$phase1 == 'hypoxic')

dim(genes_h)
head(genes_h)

Unnamed: 0_level_0,X,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,phase1,more_me_in
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
1,LOC1111094521,3.338687,0.3118057,0.2956137,0.2090759,0.834389,1.0,hypoxic,not significant
2,LOC1111248021,316.608142,-0.07839417,0.1033729,0.0,1.0,1.0,hypoxic,not significant
3,LOC1111012731,107.996235,-0.05650996,0.124215,0.0,1.0,1.0,hypoxic,not significant
4,LOC1111012501,171.639752,-0.04906065,0.1585186,0.0,1.0,1.0,hypoxic,not significant
5,LOC1111012621,399.065246,-0.07049632,0.1311921,0.0,1.0,1.0,hypoxic,not significant
6,LOC1111332601,30.078658,1.72346224,0.2533528,5.8158519,6.032586e-09,3.591274e-06,hypoxic,hypoxic


In [17]:
# KEGG uses entrez IDs, which are my ensembl IDs without the 'LOC' in front of them, so need to convert those
genes_h$X <- substr(genes_h$X, start = 4, stop = nchar(genes_h$X))

# only grabbing the columns I care about
genes_h <- select(genes_h, X, log2FoldChange, padj)

# renaming columns to make more sense
colnames(genes_h) <- c('gene', 'lfc', 'padj')

# only selecting significant genes
genes_h <- filter(genes_h, genes_h$padj < 0.05)

# checking dimensions
dim(genes_h) # 231 sig DMGs

head(genes_h)

Unnamed: 0_level_0,gene,lfc,padj
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
1,1111332601,1.723462,3.591274e-06
2,1111098091,1.986564,0.001406264
3,1111041531,1.441954,0.00354706
4,1111283141,1.404023,0.007499595
5,1111287841,1.968813,0.001655392
6,1111130221,1.254941,0.02726499


#### III. Are the signficant genes in the core enrichment of the pathways?
Taking our list of siginificant genes, and going line by line in our pathways to see if our sig. gene matches any of the core enrichment genes

the code below will add a column to my dataframe that contains the number of significant DMGs that match the core enrichment for that pathway

In [19]:
# generated from ChatGPT

# Sample dataframes
df1 <- genes_h
df2 <- pathway

# Function to check if any gene in df1 matches genes in a row of df2
get_gene_matches <- function(row_df2, df1_genes) {
  genes2 <- unlist(strsplit(as.character(row_df2), "/"))
  count <- sum(genes2 %in% df1_genes)
  return(count)
}

# Iterate over each row of df2
matches_count <- sapply(df2$core_enrichment, get_gene_matches, df1_genes = df1$gene)

# Add the matches count to df2
df2$MatchesCount <- matches_count

# Sort df with highest match counts at the top
gene_pathway_match <- df2[order(-df2$MatchesCount),]
head(gene_pathway_match)


Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment,MatchesCount
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>,<int>
1,cvn00270,Cysteine and methionine metabolism,43,0.5220963,1.717776,0.003026136,0.3661625,0.3631363,1555,"tags=30%, list=12%, signal=27%",111137596/111122141/111135192/111129934/111136621/111122163/111110831/111130865/111100699/111133693/111106176/111111318/111116065,0
2,cvn00592,alpha-Linolenic acid metabolism,11,-0.6794618,-1.741109,0.009046136,0.4181536,0.4146978,2277,"tags=73%, list=17%, signal=60%",111123661/111124908/111136066/111136438/111107112/111115744/111113990/111127642,0
3,cvn03082,ATP-dependent chromatin remodeling,69,-0.3769411,-1.506744,0.010367445,0.4181536,0.4146978,2797,"tags=41%, list=21%, signal=32%",111120915/111120504/111136148/111135329/111132974/111118535/111130322/111119035/111099792/111114842/111105716/111134187/111123066/111128754/111130152/111125973/111118359/111108477/111129852/111128560/111120856/111120594/111127973/111127274/111114783/111128559/111105834/111133731,0
4,cvn01040,Biosynthesis of unsaturated fatty acids,18,-0.5761533,-1.696737,0.016592268,0.5019161,0.497768,2054,"tags=50%, list=15%, signal=42%",111124908/111136066/111136438/111131209/111107112/111115744/111113990/111119293/111129730,0
5,cvn00100,Steroid biosynthesis,10,0.6506083,1.480972,0.053763441,0.5681397,0.5634443,1767,"tags=30%, list=13%, signal=26%",111134862/111134947/111112479,0
6,cvn00510,N-Glycan biosynthesis,33,-0.4283362,-1.478479,0.061039803,0.5681397,0.5634443,3156,"tags=45%, list=24%, signal=35%",111113415/111118581/111119558/111101820/111125632/111124588/111136555/111101197/111122131/111124498/111134828/111121994/111137033/111136571/111126213,0


above returns just the ***number*** of significant genes in the core enrichment for each pathway

below returns the ***percent*** of significant genes in the core enrichment

it's interesting that the percent returns a much different top 5 ... maybe this is the better way to do it?

In [20]:
# Sample dataframes
df1 <- genes_h
df2 <- pathway

# Function to check if any gene in df1 matches genes in a row of df2
get_gene_matches <- function(row_df2, df1_genes) {
  genes2 <- unlist(strsplit(as.character(row_df2), "/"))
  count <- sum(genes2 %in% df1_genes)
  # Calculate the percentage of matched genes in the row
  percentage <- count / length(genes2) * 100
  return(percentage)
}

# Iterate over each row of df2
matches_percentage <- sapply(df2$core_enrichment, get_gene_matches, df1_genes = df1$gene)

# Add the matches percentage to df2
df2$MatchesPercentage <- matches_percentage

# Sort df with highest match percentages at the top
gene_pathway_match <- df2[order(-df2$MatchesPercentage),]
head(gene_pathway_match, 10)


Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment,MatchesPercentage
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>,<dbl>
1,cvn00270,Cysteine and methionine metabolism,43,0.5220963,1.717776,0.003026136,0.3661625,0.3631363,1555,"tags=30%, list=12%, signal=27%",111137596/111122141/111135192/111129934/111136621/111122163/111110831/111130865/111100699/111133693/111106176/111111318/111116065,0
2,cvn00592,alpha-Linolenic acid metabolism,11,-0.6794618,-1.741109,0.009046136,0.4181536,0.4146978,2277,"tags=73%, list=17%, signal=60%",111123661/111124908/111136066/111136438/111107112/111115744/111113990/111127642,0
3,cvn03082,ATP-dependent chromatin remodeling,69,-0.3769411,-1.506744,0.010367445,0.4181536,0.4146978,2797,"tags=41%, list=21%, signal=32%",111120915/111120504/111136148/111135329/111132974/111118535/111130322/111119035/111099792/111114842/111105716/111134187/111123066/111128754/111130152/111125973/111118359/111108477/111129852/111128560/111120856/111120594/111127973/111127274/111114783/111128559/111105834/111133731,0
4,cvn01040,Biosynthesis of unsaturated fatty acids,18,-0.5761533,-1.696737,0.016592268,0.5019161,0.497768,2054,"tags=50%, list=15%, signal=42%",111124908/111136066/111136438/111131209/111107112/111115744/111113990/111119293/111129730,0
5,cvn00100,Steroid biosynthesis,10,0.6506083,1.480972,0.053763441,0.5681397,0.5634443,1767,"tags=30%, list=13%, signal=26%",111134862/111134947/111112479,0
6,cvn00510,N-Glycan biosynthesis,33,-0.4283362,-1.478479,0.061039803,0.5681397,0.5634443,3156,"tags=45%, list=24%, signal=35%",111113415/111118581/111119558/111101820/111125632/111124588/111136555/111101197/111122131/111124498/111134828/111121994/111137033/111136571/111126213,0
7,cvn00860,Porphyrin metabolism,18,0.538562,1.464175,0.05704698,0.5681397,0.5634443,324,"tags=22%, list=2%, signal=22%",111111231/111120634/111104407/111117758,0
8,cvn03460,Fanconi anemia pathway,41,0.4485237,1.453826,0.0382263,0.5681397,0.5634443,2809,"tags=34%, list=21%, signal=27%",111122179/111128124/111126926/111136867/111131048/111119232/111130580/111118905/111113980/111131835/111133998/111108639/111119126/111102631,0
9,cvn00514,Other types of O-glycan biosynthesis,29,0.4717735,1.436501,0.057416268,0.5681397,0.5634443,1855,"tags=34%, list=14%, signal=30%",111128050/111129166/111112794/111125659/111126140/111118200/111136634/111129961/111110946/111122340,0
10,cvn04120,Ubiquitin mediated proteolysis,117,0.3668674,1.43229,0.030278099,0.5681397,0.5634443,3046,"tags=34%, list=23%, signal=27%",111121155/111099688/111100396/111104196/111120860/111137343/111135325/111131129/111119134/111105597/111138340/111104637/111124955/111129295/111105462/111130310/111114538/111102530/111136470/111103790/111120632/111123155/111130966/111110185/111103787/111103271/111120128/111121135/111120086/111128926/111133998/111129467/111132898/111100417/111104077/111121443/111128564/111126002/111103982/111129365,0


# phase 1 = control or hypoxia, **phase 2 = control**


In [21]:
# load in csv file
pathway <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/enriched_pathways/p2cont_p1ch_pathways.csv')

# clean headers and columns
pathway <- pathway[,-1]

# checking dimensions
dim(pathway) #121 pathways, 11 rows of info/meta data

head(pathway)

Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>
1,cvn00592,alpha-Linolenic acid metabolism,11,0.781259,1.965731,0.0003842742,0.04649718,0.0461129,1762,"tags=73%, list=13%, signal=63%",111113990/111115744/111127642/111136066/111123661/111107112/111115745/111136438
2,cvn01040,Biosynthesis of unsaturated fatty acids,18,0.542455,1.582749,0.02463919,0.50432328,0.5001553,1908,"tags=44%, list=14%, signal=38%",111113990/111115744/111129730/111136066/111107112/111115745/111136438/111131209
3,cvn00310,Lysine degradation,31,-0.4896714,-1.538601,0.0259889723,0.50432328,0.5001553,1842,"tags=32%, list=14%, signal=28%",111115614/111109254/111130627/111121380/111130119/111107127/111125659/111110608/111112920/111128625
4,cvn03008,Ribosome biogenesis in eukaryotes,56,-0.4223369,-1.47521,0.0197349344,0.50432328,0.5001553,2942,"tags=39%, list=22%, signal=31%",111104038/111122686/111123620/111128896/111134591/111103436/111102803/111105066/111119396/111123381/111112561/111110086/111119458/111125104/111128153/111132055/111119695/111121480/111128265/111128132/111120056/111133163
5,cvn04148,Efferocytosis,98,-0.3698939,-1.419869,0.0201014076,0.50432328,0.5001553,2846,"tags=32%, list=21%, signal=25%",111128693/111124840/111112952/111134431/111120235/111135946/111113319/111135761/111133023/111125427/111124014/111127575/111122108/111123664/111107163/111100224/111107779/111126115/111117732/111123084/111136548/111137094/111125144/111110874/111115463/111122163/111128744/111131845/111109809/111104335/111123492
6,cvn04068,FoxO signaling pathway,65,-0.395892,-1.419769,0.0278332376,0.50432328,0.5001553,2846,"tags=38%, list=21%, signal=30%",111128693/111121135/111125223/111121739/111131500/111118834/111112841/111103474/111126185/111134642/111119108/111107163/111105462/111130138/111120632/111121839/111113171/111121740/111119905/111112940/111134713/111120947/111102390/111128744/111104196


loading in data frame that contains significant genes info for **phase 1 control or hypoxic, phase 2 control**

In [22]:
# load data frame
genes_c <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/significant_genes/p2c_lfc25_genes.csv')

# KEGG uses entrez IDs, which are my ensembl IDs without the 'LOC' in front of them, so need to convert those
genes_c$X <- substr(genes_c$X, start = 4, stop = nchar(genes_c$X))

# only grabbing the columns I care about
genes_c <- select(genes_c, X, log2FoldChange, padj)

# renaming columns to make more sense
colnames(genes_c) <- c('gene', 'lfc', 'padj')

# only selecting significant genes
genes_c <- filter(genes_c, genes_c$padj < 0.05)

# checking dimensions
dim(genes_c) # 231 sig DMGs

head(genes_c)

Unnamed: 0_level_0,gene,lfc,padj
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
1,111133260,-1.400662,0.03210292
2,111130870,1.930161,0.0051637
3,111124824,-1.748396,0.02225067
4,111123492,-1.250427,0.02225067
5,111124669,-1.375254,0.02225067
6,111129146,1.352978,0.02225067


#### III. Are the signficant genes in the core enrichment of the pathways?
Taking our list of siginificant genes, and going line by line in our pathways to see if our sig. gene matches any of the core enrichment genes

the code below will add a column to my dataframe that contains the number of significant DMGs that match the core enrichment for that pathway

In [23]:
# generated from ChatGPT

# Sample dataframes
df1 <- genes_c
df2 <- pathway

# Function to check if any gene in df1 matches genes in a row of df2
get_gene_matches <- function(row_df2, df1_genes) {
  genes2 <- unlist(strsplit(as.character(row_df2), "/"))
  count <- sum(genes2 %in% df1_genes)
  return(count)
}

# Iterate over each row of df2
matches_count <- sapply(df2$core_enrichment, get_gene_matches, df1_genes = df1$gene)

# Add the matches count to df2
df2$MatchesCount <- matches_count

# Sort df with highest match counts at the top
gene_pathway_match <- df2[order(-df2$MatchesCount),]
head(gene_pathway_match)


Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment,MatchesCount
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>,<int>
5,cvn04148,Efferocytosis,98,-0.3698939,-1.419869,0.0201014076,0.50432328,0.5001553,2846,"tags=32%, list=21%, signal=25%",111128693/111124840/111112952/111134431/111120235/111135946/111113319/111135761/111133023/111125427/111124014/111127575/111122108/111123664/111107163/111100224/111107779/111126115/111117732/111123084/111136548/111137094/111125144/111110874/111115463/111122163/111128744/111131845/111109809/111104335/111123492,1
41,cvn03010,Ribosome,82,-0.3120704,-1.168279,0.1880597015,0.6174615,0.6123585,308,"tags=13%, list=2%, signal=13%",111137581/111102380/111104169/111132903/111130094/111137089/111133074/111130039/111111438/111127769/111124824,1
1,cvn00592,alpha-Linolenic acid metabolism,11,0.781259,1.965731,0.0003842742,0.04649718,0.0461129,1762,"tags=73%, list=13%, signal=63%",111113990/111115744/111127642/111136066/111123661/111107112/111115745/111136438,0
2,cvn01040,Biosynthesis of unsaturated fatty acids,18,0.542455,1.582749,0.02463919,0.50432328,0.5001553,1908,"tags=44%, list=14%, signal=38%",111113990/111115744/111129730/111136066/111107112/111115745/111136438/111131209,0
3,cvn00310,Lysine degradation,31,-0.4896714,-1.538601,0.0259889723,0.50432328,0.5001553,1842,"tags=32%, list=14%, signal=28%",111115614/111109254/111130627/111121380/111130119/111107127/111125659/111110608/111112920/111128625,0
4,cvn03008,Ribosome biogenesis in eukaryotes,56,-0.4223369,-1.47521,0.0197349344,0.50432328,0.5001553,2942,"tags=39%, list=22%, signal=31%",111104038/111122686/111123620/111128896/111134591/111103436/111102803/111105066/111119396/111123381/111112561/111110086/111119458/111125104/111128153/111132055/111119695/111121480/111128265/111128132/111120056/111133163,0


above returns just the ***number*** of significant genes in the core enrichment for each pathway

below returns the ***percent*** of significant genes in the core enrichment

it's interesting that the percent returns a much different top 5 ... maybe this is the better way to do it?

In [25]:
# Sample dataframes
df1 <- genes_c
df2 <- pathway

# Function to check if any gene in df1 matches genes in a row of df2
get_gene_matches <- function(row_df2, df1_genes) {
  genes2 <- unlist(strsplit(as.character(row_df2), "/"))
  count <- sum(genes2 %in% df1_genes)
  # Calculate the percentage of matched genes in the row
  percentage <- count / length(genes2) * 100
  return(percentage)
}

# Iterate over each row of df2
matches_percentage <- sapply(df2$core_enrichment, get_gene_matches, df1_genes = df1$gene)

# Add the matches percentage to df2
df2$MatchesPercentage <- matches_percentage

# Sort df with highest match percentages at the top
gene_pathway_match <- df2[order(-df2$MatchesPercentage),]
head(gene_pathway_match, 10)


Unnamed: 0_level_0,ID,Description,setSize,enrichmentScore,NES,pvalue,p.adjust,qvalue,rank,leading_edge,core_enrichment,MatchesPercentage
Unnamed: 0_level_1,<chr>,<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>,<chr>,<dbl>
41,cvn03010,Ribosome,82,-0.3120704,-1.168279,0.1880597015,0.6174615,0.6123585,308,"tags=13%, list=2%, signal=13%",111137581/111102380/111104169/111132903/111130094/111137089/111133074/111130039/111111438/111127769/111124824,9.090909
5,cvn04148,Efferocytosis,98,-0.3698939,-1.419869,0.0201014076,0.50432328,0.5001553,2846,"tags=32%, list=21%, signal=25%",111128693/111124840/111112952/111134431/111120235/111135946/111113319/111135761/111133023/111125427/111124014/111127575/111122108/111123664/111107163/111100224/111107779/111126115/111117732/111123084/111136548/111137094/111125144/111110874/111115463/111122163/111128744/111131845/111109809/111104335/111123492,3.225806
1,cvn00592,alpha-Linolenic acid metabolism,11,0.781259,1.965731,0.0003842742,0.04649718,0.0461129,1762,"tags=73%, list=13%, signal=63%",111113990/111115744/111127642/111136066/111123661/111107112/111115745/111136438,0.0
2,cvn01040,Biosynthesis of unsaturated fatty acids,18,0.542455,1.582749,0.02463919,0.50432328,0.5001553,1908,"tags=44%, list=14%, signal=38%",111113990/111115744/111129730/111136066/111107112/111115745/111136438/111131209,0.0
3,cvn00310,Lysine degradation,31,-0.4896714,-1.538601,0.0259889723,0.50432328,0.5001553,1842,"tags=32%, list=14%, signal=28%",111115614/111109254/111130627/111121380/111130119/111107127/111125659/111110608/111112920/111128625,0.0
4,cvn03008,Ribosome biogenesis in eukaryotes,56,-0.4223369,-1.47521,0.0197349344,0.50432328,0.5001553,2942,"tags=39%, list=22%, signal=31%",111104038/111122686/111123620/111128896/111134591/111103436/111102803/111105066/111119396/111123381/111112561/111110086/111119458/111125104/111128153/111132055/111119695/111121480/111128265/111128132/111120056/111133163,0.0
6,cvn04068,FoxO signaling pathway,65,-0.395892,-1.419769,0.0278332376,0.50432328,0.5001553,2846,"tags=38%, list=21%, signal=30%",111128693/111121135/111125223/111121739/111131500/111118834/111112841/111103474/111126185/111134642/111119108/111107163/111105462/111130138/111120632/111121839/111113171/111121740/111119905/111112940/111134713/111120947/111102390/111128744/111104196,0.0
7,cvn04146,Peroxisome,77,0.3458777,1.383635,0.0291757268,0.50432328,0.5001553,1908,"tags=31%, list=14%, signal=27%",111135891/111113990/111115744/111104085/111107393/111130560/111135288/111136066/111117608/111136587/111107606/111107112/111107850/111107491/111115745/111110283/111135201/111132335/111136438/111121658/111135419/111128430/111108763/111131209,0.0
8,cvn04980,Cobalamin transport and metabolism,13,0.5375633,1.443707,0.0620689655,0.6174615,0.6123585,1324,"tags=23%, list=10%, signal=21%",111126313/111130560/111122749,0.0
9,cvn00591,Linoleic acid metabolism,11,0.5716292,1.43828,0.0704545455,0.6174615,0.6123585,3429,"tags=55%, list=26%, signal=41%",111127642/111123661/111127589/111127588/111121119/111111230,0.0


### are the sig genes in second exposure control the same ones that are sig in first exposure hypoxia

In [35]:
genes_control <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/significant_genes/p2c_lfc25_genes.csv')
genes_control <- filter(genes_control, genes_control$padj < 0.05)

genes_control <- select(genes_control,'X', 'log2FoldChange')
colnames(genes_control) <- c('gene', 'lfc')

genes_control

gene,lfc
<chr>,<dbl>
LOC111133260,-1.400662
LOC111130870,1.930161
LOC111124824,-1.748396
LOC111123492,-1.250427
LOC111124669,-1.375254
LOC111129146,1.352978
LOC111133892,-1.039787
LOC111133874,-1.008349
LOC111132673,1.418782
LOC111109617,-2.260938


In [34]:
genes_hypoxia <- read.csv('/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/significant_genes/p1h_lfc25_genes.csv')
genes_hypoxia <- filter(genes_hypoxia, genes_hypoxia$padj < 0.05)

genes_hypoxia <- select(genes_hypoxia, 'X', 'log2FoldChange')
colnames(genes_hypoxia) <- c('gene', 'lfc')

dim(genes_hypoxia)
head(genes_hypoxia)

Unnamed: 0_level_0,gene,lfc
Unnamed: 0_level_1,<chr>,<dbl>
1,LOC111133260,1.723462
2,LOC111109809,1.986564
3,LOC111104153,1.441954
4,LOC111128314,1.404023
5,LOC111128784,1.968813
6,LOC111113022,1.254941


checking what genes that were significant for phase 2 control, phase 1 control/hypoxia are also significantly DMG in phase 1 hypoxia, phase 2 control/hypoxia

In [38]:
hypoxia <- genes_hypoxia[genes_hypoxia$gene %in% genes_control$gene, ]
hypoxia

Unnamed: 0_level_0,gene,lfc
Unnamed: 0_level_1,<chr>,<dbl>
1,LOC111133260,1.723462
70,LOC111124824,2.394202
75,LOC111123492,1.62809
84,LOC111124669,1.533885
123,LOC111129146,-1.600709
144,LOC111133892,1.070634


so these are the log fold changes for the genes in phase 1 hypoxia, phase 2 control/hypoxia

> -LFC = more methylation in hypoxia control

> +LFC = more methylation in hypoxia hypoxia

In [40]:
control <- genes_control[genes_control$gene %in% hypoxia$gene, ]
control

Unnamed: 0_level_0,gene,lfc
Unnamed: 0_level_1,<chr>,<dbl>
1,LOC111133260,-1.400662
3,LOC111124824,-1.748396
4,LOC111123492,-1.250427
5,LOC111124669,-1.375254
6,LOC111129146,1.352978
7,LOC111133892,-1.039787


these log fold changes are for genes in phase 1 hypoxia/control and phase 2 control

> -LFC = more methylation in control control

> +LFC = more methylation in hypoxia control

In [41]:
colnames(hypoxia) <- c('gene', 'hypoxia_lfc')
colnames(control) <- c('gene', 'control_lfc')

merge(hypoxia, control, by='gene')

gene,hypoxia_lfc,control_lfc
<chr>,<dbl>,<dbl>
LOC111123492,1.62809,-1.250427
LOC111124669,1.533885,-1.375254
LOC111124824,2.394202,-1.748396
LOC111129146,-1.600709,1.352978
LOC111133260,1.723462,-1.400662
LOC111133892,1.070634,-1.039787


In [42]:
write.csv(merge(hypoxia, control, by='gene'), '/project/pi_sarah_gignouxwolfsohn_uml_edu/julia/CE_MethylRAD_analysis_2018/analysis/shared_genes.csv')