# About the notebook
[Back to the topic](pathway_toc.ipynb)

We are in step 02 & 03 now. The aim of this notebook is to read the pathway gene sets of *Cryptococcus neoformans var grubii h99* into R. Because the function `gage` for gene set analysis require the gene sets to be a R list object, we need to arrange and convert the gene sets from a dataframe to a list.

<img src="./fig/03 pathway analysis steps.png">

----

# Set environment

In [1]:
source("Pathway_config.R")
source("Pathway_util.R")

Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colMeans, colnames,


# Read in the pathway of cne h99

The data is downloaded from [FungiDB](http://fungidb.org) with [this strategy](http://fungidb.org/fungidb/im.do?s=b7b27b78af2797b1).

In [2]:
### read in the data
tmp <- read_tsv(file.path(INFODIR, "pathway_cne_h99_fromFungidb.txt"), col_names = TRUE)
tmp <- tmp %>% dplyr::select(-X5)

### change the column name
colnames(tmp) <- c("id", "name", "gene", "source")
dat_pathway_cne_h99 <- tmp

### show the first two rows
head(dat_pathway_cne_h99, 2)

“Missing column names filled in: 'X5' [5]”Parsed with column specification:
cols(
  `[Pathway Id]` = col_character(),
  `[Pathway]` = col_character(),
  `[Genes]` = col_character(),
  `[Pathway Source]` = col_character(),
  X5 = col_character()
)


id,name,gene,source
ec00010,Glycolysis / Gluconeogenesis,CNAG_00038 | CNAG_00057 | CNAG_00515 | CNAG_00735 | CNAG_00797 | CNAG_01078 | CNAG_01120 | CNAG_01675 | CNAG_01820 | CNAG_01955 | CNAG_02035 | CNAG_02377 | CNAG_02489 | CNAG_02736 | CNAG_02903 | CNAG_03072 | CNAG_03358 | CNAG_03916 | CNAG_04217 | CNAG_04523 | CNAG_04659 | CNAG_04676 | CNAG_05059 | CNAG_05113 | CNAG_06035 | CNAG_06313 | CNAG_06628 | CNAG_06699 | CNAG_06770 | CNAG_07004 | CNAG_07316 | CNAG_07559 | CNAG_07660 | CNAG_07745,KEGG
ec00020,Citrate cycle (TCA cycle),CNAG_00061 | CNAG_00747 | CNAG_01120 | CNAG_01264 | CNAG_01657 | CNAG_01680 | CNAG_02736 | CNAG_03225 | CNAG_03226 | CNAG_03266 | CNAG_03375 | CNAG_03596 | CNAG_03674 | CNAG_03920 | CNAG_04189 | CNAG_04217 | CNAG_04468 | CNAG_04535 | CNAG_04640 | CNAG_05059 | CNAG_05236 | CNAG_05907 | CNAG_07004 | CNAG_07356 | CNAG_07363 | CNAG_07660 | CNAG_07851 | CNAG_07944,KEGG


now we need to convert it to a list with pathway names as names and gene character vectors as elements.

In [3]:
### arrange the data frame
dat <- dat_pathway_cne_h99 
dat <- dat %>% dplyr::select(-source)              # we don't need the column source
dat <- dat %>% unite(label, id, name, sep = " | ") # combine the column id and name
head(dat, 2)

label,gene
ec00010 | Glycolysis / Gluconeogenesis,CNAG_00038 | CNAG_00057 | CNAG_00515 | CNAG_00735 | CNAG_00797 | CNAG_01078 | CNAG_01120 | CNAG_01675 | CNAG_01820 | CNAG_01955 | CNAG_02035 | CNAG_02377 | CNAG_02489 | CNAG_02736 | CNAG_02903 | CNAG_03072 | CNAG_03358 | CNAG_03916 | CNAG_04217 | CNAG_04523 | CNAG_04659 | CNAG_04676 | CNAG_05059 | CNAG_05113 | CNAG_06035 | CNAG_06313 | CNAG_06628 | CNAG_06699 | CNAG_06770 | CNAG_07004 | CNAG_07316 | CNAG_07559 | CNAG_07660 | CNAG_07745
ec00020 | Citrate cycle (TCA cycle),CNAG_00061 | CNAG_00747 | CNAG_01120 | CNAG_01264 | CNAG_01657 | CNAG_01680 | CNAG_02736 | CNAG_03225 | CNAG_03226 | CNAG_03266 | CNAG_03375 | CNAG_03596 | CNAG_03674 | CNAG_03920 | CNAG_04189 | CNAG_04217 | CNAG_04468 | CNAG_04535 | CNAG_04640 | CNAG_05059 | CNAG_05236 | CNAG_05907 | CNAG_07004 | CNAG_07356 | CNAG_07363 | CNAG_07660 | CNAG_07851 | CNAG_07944


use str_split to [split by vertical bar "|"](https://stackoverflow.com/questions/23193219/strsplit-with-vertical-bar-pipe)

In [4]:
### use str_split to split
### split by vertical bar "|"
str_split(dat$gene[1:2], "\\|")

now let's apply the str_split to all elements to create the gene set list

In [5]:
lst <- str_split(dat$gene, "\\|")
lst <- lapply(lst, trimws)
lst[1:2]

In [6]:
### use str_split to create a list of gene vector
lst <- str_split(dat$gene, "\\|")
lst <- lapply(lst, trimws)

### assign the pathway name to as the name of the list
names(lst) <- dat$label
genesets_cne_h99 <- lst

### print the results
head(genesets_cne_h99, 2)

# Store the results

In [7]:
outfile <- file.path(OUTDIR, "genesets_cne_h99.RData")
save(genesets_cne_h99, file = outfile)