<a href="https://colab.research.google.com/github/joanall/Hands-On-Epigenome-Wide-Analysis-EWAS-/blob/main/ewas_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Hands-on Data EWAS Analysis Tutorial**  

Joana Llauradó, Predoctoral Researcher at the Barcelona Institute for Global Health (ISGlobal).

Mariona Bustamante, Senior Research Scientist at the Barcelona Institute for Global Health (ISGlobal).


**Objective:** his hands-on exercise focuses on the analysis phase of an Epigenome-Wide Association Study (EWAS). Using preprocessed DNA methylation data, participants will learn how to identify associations between an environmental exposure—such as smoking—and methylation levels across the genome.


**Description:** During this practical, we will perform EWAS analysis and basic downstream interpretation to gain biological insight from the results. Participants will learn how to run association models, correct for multiple testing, visualize significant findings (e.g., using Manhattan and QQ plots), and interpret the top CpG sites and enriched pathways related to the exposure of interest.

**Based on/adapted from** Previous course Summer School in Global Health 2023-EWAS [https://github.com/isglobal-brge/course_methylation?tab=readme-ov-file]

**Considerations**
* The data quality control (QC) step will not be assessed in this session. However, participants can explore QC independently using the example code provided [here: insert link], which demonstrates h ow to perform standard preprocessing and QC steps.

* Public data from [link] It must be noted that this data is public and has been revied/selcetd to be able tor un this nalaysis in a short period of time. Ewas data usually is upt to XXXX GB and a compute with enough memory or clsuter is nneded to run this nalaysis. Also the time needed is higher than the one for this Hands-on.

•
Recommendations for the design and analysis of epigenome wide association studies
https://clinicalepigeneticsjournal.biomedcentral.com/articles/10.1186/s13148 021 01200 8
•
Epigenetic Signatures of Cigarette Smoking
https://www.ahajournals.org/doi/full/10.1161/CIRCGENETICS.116.001506
•
Meffil : efficient normalization and analysis of very large DNA methylation datasets (ADDED
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6247925/
•
Orchestrating high throughput genomic analysis with Bioconductor
https://www.nature.com/articles/nmeth.3252

**Reminder: Introduction to NoteBook**
Within this notebook (NoteBook), you will be guided step by step from loading a dataset to performing analysis of its content.

The Jupyter (Python) notebook is an approach that combines text blocks (like this one) together with code blocks or cells. The great advantage of this type of cell is its interactivity, as they can be executed to check the results directly within them. Very important: the order of instructions is fundamental, so each cell in this notebook must be executed sequentially. If any are omitted, the program may throw an error, so you should start from the beginning if in doubt.

First of all:

It is very very important that at the start you select "Open in draft mode" (draft mode), at the top left. Otherwise, it will not allow you to execute any code block, for security reasons. When the first of the blocks is executed, the following message will appear: "Warning: This notebook was not created by Google.". Do not worry, you should trust the content of the notebook (NoteBook) and click "Run anyway".

Let’s go!

Click the "play" button on the left side of each code cell. Lines of code that begin with a hashtag (#) are comments and do not affect the execution of the program.

You can also click on each cell and press "ctrl+enter" (cmd+enter on Mac).

Each time you run a block, you will see the output just below it. The information is usually always related to the last instruction, along with all the print() commands in the code.

## **INDEX**
1. [Installation of the R environment and required libraries for EWAS](#install-libraries)
2. [Load data](#load-data)
3. [Descriptive analysis](#descriptive)   
4. [Epigenome wide association analysis](#association)
3. [Other tutorials for previours-further steps](#tutorials)   
4. [Acknowledgement](#acknowledgement)   


## **1. Installation of the R Environment and Libraries for EWAS Analysis** <a name="install-libraries"></a>

# Below, we install/load the libraries necessary for this session. In the context of exposome analysis, R libraries offer us a much more convenient way to process, manipulate, and analyze the data. Some of these libraries: `tidyverse`, `name_library`  
  
The installation of R in our Google Colab environment will be carried out in the following code block. It should be remembered that all library installations we perform in the Google Colab environment will only remain active for a few hours, after which the installed libraries are removed. Therefore, it will be necessary for you to re-run the library installation code in this section whenever you need to run the notebook again after this time.

**Note:** We recommend installing the libraries **30 minutes** before the start of the session❗❗❗

In [None]:
# Estimated execution time: 7 minutes approx.
t0 <- Sys.time()

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

packages <- c("GenomicRanges", "GEOquery", "meffil", "Biobase", "qqman",
"ggplot2", "ggrepel", "karyoplotR")

# other we may need: "IlluminaHumanMethylation450kanno.ilmn12.hg19", "clusterProfiler",
#                       "org.Hs.eg.db", "ReactomePA", "enrichplot", metafor, reshape plyr

to_install <- setdiff(packages, rownames(installed.packages()))
if (length(to_install)) {
  BiocManager::install(to_install, ask = FALSE, update = FALSE)
}

invisible(lapply(packages, require, character.only = TRUE))
elapsed_min <- as.numeric(difftime(Sys.time(), t0, units = "mins"))
cat(sprintf("Execution time: %.1f minutes\n", elapsed_min))


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cran.rstudio.com

Bioconductor version 3.21 (BiocManager 1.30.26), R 4.5.1 (2025-06-13)

Installing package(s) 'BiocVersion', 'GenomicRanges', 'GEOquery', 'meffil',
  'Biobase', 'qqman'

“package ‘meffil’ is not available for Bioconductor version '3.21'

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages”
also installing the dependencies ‘matrixStats’, ‘abind’, ‘SparseArray’, ‘UCSC.utils’, ‘GenomeInfoDbData’, ‘statmod’, ‘XML’, ‘R.oo’, ‘R.methodsS3’, ‘MatrixGenerics’, ‘S4Arrays’, ‘DelayedArray’, ‘BiocGenerics’, ‘S4Vectors’, ‘IRanges’, ‘GenomeInfoDb’, ‘XVector’, ‘limma’, ‘rentrez’, ‘R.utils’, ‘SummarizedExperim

Execution time: 7.3 minutes


If the installation of the package `meffil` throws errors, try this alternative command:



In [None]:
# install.packages("devtools")
# devtools::install_github("perishky/meffil")

##1.1  Load libraries

Before starting the session we need to load the libraries we jsut installed into the session. We only need to install the librabries once, but we need to load them every time we start the session. (usually? I think in gogole collab it's everytime)

In [None]:
library(BiocManager)
library(GenomicRanges)
library(GEOquery)

library(Biobase)# to be able to access and modify data in the ExpressionSet
library(meffil) # to run the EWAS
library(ggplot2)# plots
library(ggrepel)
library(GenomicRanges) # prepare db for Manhattan plot
library(karyoploteR) #  Manhattan plot



and the help system queried interactively


In [None]:
help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
?GRanges

# Case Study
Data: Cohort 1 (N = 294)
- Array: 450K
- Tissue: blood
- Ancestry: White European
- Sex: males and females
- Smoking: never, former, current
- Age: yes
- Array batch: yes
- Cells: yes
Input: ExpressionSet with matrix of beta values + covariates dataframe (exposure, covariates, cells)
Output (for current and former):
- results dataframes (not adj, adj, adj and sva)
- report (descriptive, QQ plot and lambda, Manhattan plot, Box plots)
- Volcano plot and Manhattan plot


## **2. Load the Data** <a name="load-data"></a>

Below are the **lines of code** necessary to **load** the data into the R environment. For this practical session, we will use public data from xxxxxxx exposome study.


ExpressionSet objects which is a data structure that contains the beta values of individuals at each CpG, their genomic information and the phenotypes of the individuals. Specific data is accessed, processed and analyzed with specific functions from diverse packages, conceived as methods acting on the ExpressionSet object.


In [None]:
ininput_data_url <- "https://raw.githubusercontent.com/joanall/Hands-On-Epigenome-Wide-Analysis-EWAS-/main/data/GSE42861_norm_cohort1_round3.xz.rds"

tmp <- tempfile(fileext = ".rds")
download.file(input_data_url, destfile = tmp, mode = "wb")  # binary mode
eset.cohort1 <- readRDS(tmp)

# quick sanity check
Biobase::validObject(eset.cohort1); dim(Biobase::exprs(eset.cohort1))



ERROR: Error in loadNamespace(x): there is no package called ‘Biobase’


### 2.1 ExpressionSet
An object of class ExpressionSet stores different tables including the expression profiles for each probe and subject assayData, phenotype data with traits measurements and covariates of interest pData, and feature data with information about the probe’s used in the expression (or methylation) array fData (e.g. annotation). Specific data is retrieved using the necessary functions. In particular, exprs () and phenoData () extract data tables for subjects’ expression levels and phenotypes/covariates, respectively. There are three other slots protocolData, experimentData and annotation (that uses Bioconductor databases as annotation data - i.e. it does not requires fData) that specify equipment-generated information about protocols, resulting publications and the platform on which the samples were assayed. Methods are implemented to extract the data from each slot of the object.

In [None]:
# some code to see

et us see how it works. The function exprs () extracts the epigenetic data in a matrix where subjects are columns and probes are rows

In [None]:
#get epigenetic data
expr <- exprs(gsm)
dim(expr)
expr[1:10, 1:5]

Phenotypes and/or covariates can be accessed using phenoData () function



In [None]:
pheno <- pData(phenoData(gsm))
head(pheno)

Use pData() function from Biobase R package to extract cohort 1 phenodata and exprs() function from Biobase R package to extract the cohort 1 methylation matrix

In [None]:
pheno.cohort1<-pData(eset.cohort1)
dim(pheno.cohort1)
methyl.cohort1<-exprs(eset.cohort1)
dim(methyl.cohort1)

# 3 Quick QC

lambda

In [None]:
qchisq(median(EWASres.cohort1$p.value,na.rm=T), df = 1,
                        lower.tail = F)/qchisq(0.5, 1)

In [None]:
#cohort 1
pvals.cohort1<-EWASres.cohort1$p.value
qq(pvals.cohort1,main=("QQPlot EWAS cohort 1 smoking Never VS Current"))

# 3 Run EWAS

## 3.1. Data preparation

Here we will show an example to run the analysis testing never smokers against current smokers.
The first step is to subset the ExpressionSet with only the samples that we need. In this case, we only keep the never smokers and the current smokers, getting rid of the former smokers.


In [None]:
table(pheno.cohort1$Smoking)

In [None]:
current <- pheno.cohort1[pheno.cohort1$Smoking %in% c('never','current'),]
dim(current)

In [None]:
eset.current <- eset.cohort1[,rownames(current)] #subset the eset eset.current

In [None]:
table(eset.current$Smoking)

Extract phenodata from the ExpressionSet with never and current smokers using pData() function from Biobase R package.

In [None]:
pheno.current<-pData(eset.current)
head(pheno.current)

Now, we need to check the exposure variable (Smoking). We need this variable as factor.

In [None]:
pheno.current$Smoking<-as.factor(pheno.current$Smoking)
class(pheno.current$Smoking)

Check the levels. We want “never” to be the reference level.To this end, you can use the relevel()function.

In [None]:
levels(pheno.current$Smoking)

In [None]:
pheno.current$Smoking <- relevel(pheno.current$Smoking, "never")
levels(pheno.current$Smoking)

Finally, create an object with the exposure variable (Smoking). You will need this object to run the meffil.ewas() function.

In [None]:
variable <- pheno.current$Smoking
class(variable)

The next step is to select the covariates of interest for the EWAS. We are interested in sex, age and cell type proportions. Check the class of these covariates before running the analysis.

In [None]:
class(pheno.current$Age)
class(pheno.current$Sex)
class(pheno.current$Bcell)
class(pheno.current$CD4T)
class(pheno.current$CD8T)
class(pheno.current$Mono)
class(pheno.current$Neu)
class(pheno.current$NK)

Then, create a new data.framewith the covariates of interest. You will need this object to run the meffil.ewas() function.

In [None]:
pheno.current$Sex<-as.factor(pheno.current$Sex)
covariates <- pheno.current[,c("Age","Sex",
                               "Bcell","CD4T","CD8T","Mono","Neu","NK")]

After having the exposure and the covariates objects ready, you will need to extract the methylation matrix from the ExpressionSet using exprs() function from Biobase R package. You will need this object to run the meffil.ewas() function.

In [None]:
methyl.current<-exprs(eset.current)
methyl.current[1:5,1:5]

Check order of the samples between the pheno and the methylation matrix. If samples are not in the same order, you could incorrectly assign the values of the variables to the samples and therefore also to the methylation.

In [None]:
table(ifelse(rownames(pheno.current)==colnames(methyl.current),
             "Matched","--NOT MATCHED--"))

## 3.2 Run Models
To run the EWAS we will use the function meffil.ewas() from meffil R package.

We need as arguments:

beta: Methylation matrix
variable: Vector with the exposure variable.
covariates: A dataframe with the covariates to include in the regression model
rlm: As we want to run a robust linear regression model, we need to specify “TRUE” in the argument “rlm”
winsorize.pct: To reduce the impact of severe outliers in the DNA methylation data, we will winsorize the methylation beta values (1%), with 0.5% at upper and lower ends of the distribution (Ghosh 2012)
sva (by default TRUE). This function will apply Surrogate Variable Analysis (SVA) to the methylation levels and covariates and include the resulting variables as covariates in the regression model to correct for technical batch (noise)

In [None]:
ewas.current <- meffil.ewas(methyl.current,
                            variable=variable,
                            covariates=covariates,
                            rlm=TRUE,
                            winsorize.pct=0.05)

## 3.3 Results

In [None]:
summary(ewas.current)

We are interested in the EWAS results. This function calculated the results
1) without adjusting for covariates, 2) adjusting for the covariates we specified above and also 3) adjusting for the covariates and the surrogate variables to correct the batch (noise)

In [None]:
summary(ewas.current$analyses)

We will use the results from the analysis adjusted for the covariates and the surrogate variables

In [None]:
res.current<-ewas.current$analyses$sva
res.current<-res.current$table

We will create a column with the Probe ID, which now is in the rows of the results dataframe

In [None]:
res.current$probeID<-rownames(res.current)

We order the results by p.value



In [None]:
res.current.ord<-res.current[order(res.current$p.value),]
head(res.current.ord)

The column names stand for:

p.value: pval significance of the association
fdr: p.value corrected by the FALSE DISCOVERY RATE method
p.holm: p.value corrected by Holm-Bonferroni method (we are not going to use it)
t.statistic
coefficient: coefficient of the association (the effect or beta)
coefficient.ci.high: confidence interval up
coefficient.ci.low: confindence interval down
coefficient.se: standard error of the coefficient
n: sample size included in the analysis
chromosome: chromosome in which the CpG is located in the genome (hg19)
position: position of the CpG in the genome (hg19)

Significant hits

Our results were corrected for multiple testing using FDR method. Significance was defined at FDR p-value < 0.05. We tested 37,842 CpGs.

In [None]:
dim(res.current.ord)

Significant CpGs with a FDR p-value <0.05



In [None]:
head(res.current.sig)

Number of significant CpGs with a positive or negative effect



In [None]:
dim(res.current.sig[res.current.sig$coefficient <0,])

In [None]:
dim(res.current.sig[res.current.sig$coefficient >0,])

### 3.3.1 Report

To explore the results, we will create a report using meffil.ewas.parameters() and meffil.ewas.summary() functions from meffil R package.

This report contains:

Sample characteristics
Covariate associations: Associations between the exposure variable (Smoking) and the covariates.
Lambdas and QQplots to examine inflation (none, all cov, sva)
Manhattan plots
Significant CpG sites. As threshold we will indicate FDR=0.05.
Specific CpGs Box plot: To observe the methylation differences between Never VS Current smokers, you can specify the CpGs that you are interested in. Today we will select our top significant CpG “cg05575921” that is annotated to AHRR gene and it is well known for its association with tabbacco.

In [None]:
ewas.parameters <- meffil.ewas.parameters(max.plots = 1,model="sva", sig.threshold = 3.86e-04) #FDR (0.05) = p.value (3.86e-04)
candidate.site <- c("cg05575921")
ewas.summary <- meffil.ewas.summary(ewas.current,
                                    methyl.current,
                                    selected.cpg.sites=candidate.site,
                                    parameters=ewas.parameters)

### 3.3.2 Plots
VOLCANO

In [None]:
res.current.ord$diffexpressed <- "NO"
res.current.ord$diffexpressed[res.current.ord$coefficient > 0
                              & res.current.ord$fdr <0.05] <- "POSITIVE"
res.current.ord$diffexpressed[res.current.ord$coefficient < 0
                              & res.current.ord$fdr <0.05] <- "NEGATIVE"

p <- ggplot(data=res.current.ord, aes(x=res.current.ord$coefficient, y=-log10(res.current.ord$p.value), col=res.current.ord$diffexpressed)) +
  xlim(c(-0.3,0.3))+ ylim(c(0,35)) +
  geom_point(size = 1.5) + theme_minimal() +
  labs(title = " ", x = "beta", y = "-log10(P-value)", colour = "Effect") +
  theme(axis.title = element_text(size = 14, color = "black",vjust=0.5)) +
  theme(plot.title = element_text(size = 14,face="bold",color="black",
                                  hjust= 0.5, vjust=0.5)) +
  theme(legend.title = element_text(color = "black", size = 14))


# Add lines as before...
p2 <- p + geom_hline(yintercept=c(-log10(3.86e-04)), col=c("red"),
                     linetype = "dashed") +
  theme(axis.text = element_text(size = 14)) #thres FDR (0.05) = p.val 3.86-04
mycolors<-c("#157F8D","#AF8D9B", "grey")
names(mycolors) <- c("POSITIVE", "NEGATIVE", "NO")

p3 <- p2 + scale_colour_manual(values = mycolors)

p3

MANHATTAN PLOT

In [None]:
# Create a dataframe with chr, start, end and pval
df.current<-res.current.ord[,c("chromosome","position","p.value")]
head(df.current)

In [None]:
df.current$start<-df.current$position
df.current$end<-df.current$position + 1
colnames(df.current)<-c("chr","position","p.value","start","end")
df.current<-df.current[,c("chr","start","end","p.value")]

# Create GRanges object needed
df.GRanges<-makeGRangesFromDataFrame(df.current, keep.extra.columns = TRUE)

kp <- plotKaryotype(plot.type=4, labels.plotter = NULL)
kp <- kpAddChromosomeNames(kp, cex=0.6, srt=45)
kp <- kpPlotManhattan(kp, data=df.GRanges,pval=df.current$p.value, ymax=40,genomewideline =3.42)
kp <- kpAxis(kp, ymin=0, ymax=40, cex=0.7)

# Questions-results
At the end of the practice, please answer these questions:

Which is the lambda of the unadjusted EWAS of current smoking? How does it change in adding covariates and surrogate variables?
How many CpGs are associated with current smoking (after False Discovery Rate – FDR - correction) in the model adjusted by covariates and sva?
How many of the FDR CpGs show higher methylation and how many lower methylation?
Which is the top 1 CpG? In which chromosome is located?
çUnadjusted: 2.44, adjusted for covariates: 2.22 and adjusted for covariates + sva: 1.51
289
164 show a negative effect and 125 show a positive effect
cg05575921, it is located in chromosome 5

# 4 Downstream analyiss


also it cvan be done with online tools eFORGE is used to identify cell-type or tissue-type specific signals in epigenomic data by looking at the overlap between differentially methylated positions (DMPs) with DNase I hypersensitive sites (DHSs)

EWAS Catalog is a huge database of EWAS results. We can submit the names of our top CpGs to see the last published information about them

In [None]:
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichplot)
library(DOSE)

## 4.1 Annotation
Once we have obtained a list with the CpGs that are significant in our analysis, we need to locate them in the genome and try to know which structures surround them. The annotation consists in obtaining this information:

First we load the annotation from IlluminaHumanMethylation450kanno.ilmn12.hg19 R package and select the columns of interest

anot her option is to go to ...... (ebpage)

In [None]:
data("IlluminaHumanMethylation450kanno.ilmn12.hg19")
annotation.table<- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
dim(annotation.table)

In [None]:
annotation.table<-as.data.frame(annotation.table[,c("chr","pos","strand",
                                                    "Name","Islands_Name",
                                                    "Relation_to_Island",
                                                    "UCSC_RefGene_Name",
                                                    "UCSC_RefGene_Group")])
head(annotation.table)

Merge our significant metaEWAS results with the annotation.



In [None]:
metaEWAS.ann<-merge(metaEWAS.FDRsig, annotation.table,by.x="probe",by.y="Name")
metaEWAS.ann.ord<-metaEWAS.ann[order(metaEWAS.ann$p.fe),]

head(metaEWAS.ann.ord)

Create list of Genes for Enrichment and save list of FDR CpGs and FDR genes



In [None]:
ann.genes.current<-metaEWAS.ann.ord$UCSC_RefGene_Name
ann.genes.current <- unlist(lapply(strsplit(ann.genes.current, ";"), unique))

## 4.2. Enrichment
he idea is to compare the list of genes that overlap our CpGs with the list of all the human genes that are anotated in specific databases. With this, we can see if our list of genes is a random subset or no.

First of all we convert Gene Symbols to Ensembl and Entrez Gene IDs to use them later

In [None]:
ids <- bitr(ann.genes.current, fromType="SYMBOL", toType=c("ENSEMBL", "ENTREZID"), OrgDb="org.Hs.eg.db")
head(ids)

### 4.2.1. GO terms
We will first work with the Gene Ontology (GO) database, that allows us to see if a specific gene function is overrepresented in our gene list. We need to obtain the list of all human genes that are curated in GO.

In [None]:
df = as.data.frame(org.Hs.egGO)
go_gene_list = unique(sort(df$gene_id))
ans.go <- enrichGO(gene = ids$ENTREZID,
                   ont = "BP",
                   OrgDb ="org.Hs.eg.db",
                   universe = go_gene_list,
                   readable=TRUE,
                   pvalueCutoff = 0.05)

tab.go <- as.data.frame(ans.go)
tab.go<- subset(tab.go, Count>5)
head(tab.go)

Finally we can perform different type of plots to see the results in a graphical way



In [None]:
p1 <- barplot(ans.go, showCategory=10) +
  ggtitle('Never vs Current Smokers') +
  theme(plot.title = element_text(size = 18))
p1

In [None]:
ans.go <- pairwise_termsim(ans.go)
p2 <- emapplot(ans.go, cex_label_category = 0.5, showCategory = 20) +
  ggtitle('Never vs Current Smokers') +
  theme(plot.title = element_text(size = 18))
p2

In [None]:
p3 <- cnetplot(ans.go, circular = FALSE, colorEdge = TRUE, showCategory = 2)
p3

### 4.2.2 Reactome

In [None]:
ans.react <- enrichPathway(gene=ids$ENTREZID,
                           pvalueCutoff = 0.05,
                           readable=TRUE)
tab.react <- as.data.frame(ans.react)
head(tab.react)

In this case, it exists a function that allows us to graphically investigate each of the pathwaysto see how the genes interact between them. We need to prepare a named list with the entrez gene ids and their fold change

In [None]:
ids_coef_df <- merge(ids, metaEWAS.ann.ord[,c('UCSC_RefGene_Name', 'coef.fe')],
                     by.x = 'SYMBOL',
                     by.y='UCSC_RefGene_Name')

pathway_genes <- str_split(tab.react$geneID[[1]], '/')[[1]]

ids_coef_df <- ids_coef_df[ids_coef_df$SYMBOL %in% pathway_genes,]
ids_coef_df <- ids_coef_df[!duplicated(ids_coef_df$ENTREZID),]
ids_coef <- ids_coef_df$coef.fe
names(ids_coef) <- ids_coef_df$ENTREZID


p3 <- viewPathway("Platelet activation, signaling and aggregation",
            readable = TRUE,
            foldChange = ids_coef)
p3

If the net is too busy, we can re-plot keeping just the genes on our list (the coloured ones)

In [None]:
p3$data <- p3$data[!is.na(p3$data$color),]
p3

t the end of the practice, please answer these questions:

Which is the top 1 enriched GO term in current smokers? and in former smokers?
Which are the enriched Reactome pathways in current smokers? and in former smokers?
Which are the enriched tissues in current smokers? and informer smokers?

For current smokers it is the regulation of neuron projection development. For former smokers it is embryonic organ development
For current smokers it is Platelet activation, signaling and aggregation. For former smokers it is Neuronal System
In current smokers they are blood and muscle. In former smokers they are blood and ESC