Matthew Davis 2024-11-25
- Introduction
- Installation
- Functions
- Usage
- Arguments
- Tutorials
- Legacy Functions
- Getting Help
- Contribution
- License
ggenomics is an R package that provides data visualizations using
ggplot2. It offers functions to dynamically plot genomes for exploratory
data analysis. ggenomics aims to utilize ggplot syntax to provide
base-level genomic plots that can later be customized by the user.
ggenomics was designed to simplify genomic data visualization using
ggplot2. ggenomics focuses on:
- Seamless integration with the
ggplot2ecosystem. - Dynamic plotting for exploratory analysis.
- Support for large genomic datasets.
To install required dependencies, you can use the following code:
install.packages(c("data.table", "tidyverse", "scales", "pbapply"))
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("Biostrings")You can install ggenomics from GitHub using the following command:
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
devtools::install_github("matthewwdavis/ggenomics")After installing ggenomics, load it into R and check the version:
library(ggenomics)
packageVersion("ggenomics")The functions in ggenomics create specifically structured data frames
for plotting. ggenomics is expected to continually evolve, with more
functions for analysis and plotting added overtime.
Below are some current functions in ggenomics and a very brief
description. For more information see Usage, and for in-depth
examples from start-to-finish see Tutorials.
Current ggenomics functions:
ggread_fasta()reads in fasta files.telomere_plotting_table()generates data in a format necessary forgeom_telplot().ggenom()initializes a ggplot2 object withggenomicsspecific mapping options.geom_telplot()creates a plot of chromosomes with telomeric sequences marked by size.create_window_fasta()creates windows from a fasta file read in withggread_fasta()orreadDNAStringSet().sliding_window_table()creates sliding windows from a table with columns CHROM and POS.
There are many functions that are rarely used on their own and are
instead used to facilitate other, larger functions within ggenomics.
Those functions will not receive in-depth documentation, but they are
available as separate functions for the user regardless. The code behind
these functions can be viewed in R with View(function_name).
ggenomics has two main functionalities: data analysis and plotting.
The data analysis tools are set up to be used with the plotting
functions. A typical workflow will use a specific data analysis tool to
generate a data set with specific formatting. This data set will then be
incorporated into the respective plotting functions.
With the goal of replicating ggplot2 syntax, ggenomics uses a
wrapper function, ggenom() to read in data created by other functions.
ggenom() has ggenomics specific mapping options for plotting with
ggnomics geoms. Plotting functions will be added as with ggplot2
syntax, attaching geoms to ggenom() with a +.
This section will describe the functions and cover some basic examples for each function. The Tutorials section has more in-depth start-to-finish information on usage, and further information about the arguments available in each function can be found in the Arguments section.
ggread_fasta()
- This function reads a fasta file into R. It is a wrapper for
readDNAStringSet()and creates a DNAStringSet object.ggenomicsfunctions that use fasta files will use this object, so using this function orreadDNAStringSet()for fasta files is necessary.
genome <- ggread_fasta("path/to/fasta")telomere_plotting_table()
- This function takes a fasta file and telomere string (Default = “CCCTAAA”), then looks through the fasta for the occurrence of that telomere string in specified windows (Default = 1 mb). The function will look for three of the specified telomere string back-to-back to minimize detecting the kmer not associated with telomeres. For example, if the string is “CCCTAAA”, the windows of the genome will be searched for occurrences of “CCCTAAACCCTAAACCCTAAA”. The table is then filtered for a minimum number of string occurrences per window (Default = 25).
tel.table <- telomere_plotting_table(genome)ggenom()
- This function initializes a
ggplot2object in R. It is a wrapper forggplot()and creates aggplot2object withggenomicsspecific mapping specified by theplotargument. The user should specify specificplotvalues to use with correspondingggenomicsgeoms. The properplotargument settings will be in the geom specific Usage information and in the Tutorials for creating specific plots. The options forplotand the corresponding geoms can be seen in Arguments
ggenom_object <- ggenom(tel.table, plot = "telplot")geom_telplot()
- This function generates a
ggenomicstelomere plot that can be used as a base for genome visualization. It should be added to aggenom()created object with the proper mapping information specified by theplotargument. The properplotargument value to be specified inggenom()forgeom_telplotisplot = "telplot". This can be seen here and in Tutorials. Since the plot isggplot2based, the user can customize the result however they like withggplot2, a concept which is further explored in Tutorials.
ggenom(tel.table, plot = "telplot") +
geom_telplot()
# --or-- #
ggenom_object +
geom_telplot()create_window_fasta()
- This function creates windows (Default = 1mb) from a fasta file read
with
ggread_fasta()orreadDNAStringSet(). It extracts sub-strings at regularly defined intervals from the fasta file.
fasta_windows <- create_windows_fasta(genome)sliding_window_table()
- This function creates windows (Default = 10kb) with a slide (Default =
5kb) from data.frames and data.tables with columns named CHROM and
POS. It will create 3 new columns (WINDOW_START, WINDOW_END,
POS_WINDOW) and append them to the current data. WINDOW_START is the
base pair position of the start of the window, WINDOW_END is the base
pair position of the end of the window, and POS_WINDOW is the base
pair position of the midpoint of the window. If the user does not want
the windows to slide, set the
slide_sizeargument equal to thewindow_sizeargument. If a column named SOURCE is present, the function will automatically take that into account when creating windows.
table_windows <- sliding_window_table(vcf_table)ggread_fasta()
path_to_fasta: The directory path to the fasta file of interest to read into R. Creates a DNAStringSet object.
telomere_plotting_table()
genome: DNAStringSet object of a fasta file. Can be generated withggread_fastaorreadDNAStringSet.chr_names: A character string indicating the prefix designating chromosome names. This is a crucial argument. Default is “Chr”. If the chromsomes begin with a number, use “^\d”.string_remove: A character string to remove from chromosome names. Default is “_RagTag”. If the user does not want to remove strings other than the default, this does not need to be changed.tel_start_seq: A character string representing the telomere sequence. Default is the Arabidopsis telomere repeat, “CCCTAAA”.tel_end_seq: A character string representing the reverese complement of the telomere sequence. Default is the Arabidopsis telomere repeat, “TTTAGGG”.size_windows: A numeric value specifying the size of the window to search for telomeric sequence within. Default is 1000000 (1mb).min_tel_count: A numeric value specifying the minimum telomere repeat count per window to include in the final table. Default is 25.sample_name: A character string to include the sample name in the table. Default is NULL.
ggenom()
-
data: Data needed necessary for plotting. This is generated from other functions, such astelomere_plotting_table. -
mapping: The column headers for plotting. If using theplotargument, this can be ignored. It is suggested to useplot. -
plot: The necessary mapping information for eachggenomicsstyle plot. This is different for eachggenomicsgeom specified. This is the suggested usage over mapping. Options include:- “telplot” to be used with
geom_telplot()
- “telplot” to be used with
geom_telplot()
chr_color: A character string specifying the color of the plotted chromosomes or sequences. Default is “#F8766D”.chr_size: A numeric value specifying the width of the plotted chromosomes or sequences. Default is 6.tel_color: A character string specifying the color of the plotted telomeres. Default is “black”.tel_shape: A numeric value specifying the shape of the plotted telomeres. Default is 16.legend_title: A character string specifying the title of the legend. Default is “Telomere Size (bp)”.text_size: A numeric value specifying the base size for plot text like axis labels and legends. Default is 6.plot_title: A character string specifying the title of the plot Default is NULL.x_axis_title: A character string specifying the title of the x-axis Default is NULL.y_axis_title: A character string specifying the title of the y-axis Default is “Chromosome Length”.
create_window_fasta()
genome: DNAStringSet object of a fasta file. Can be generated withggread_fasta()orreadDNAStringSet().window_size: A numeric value specifying the size of the window. Default is 1000000 (1mb).
sliding_window_table()
mut_table: A data.frame or data.table containing genomic data. Must have columns CHROM and POS.window_size: Numeric value specifying size of the window. Default is 10000 (10kb).slide_size: Numeric value specifying step size for sliding window. If the user does not want slide, the step size should equal the window size. Default is 5000 (5kb).
The following examples are more in-depth than what is found in Usage and meant to walk the user through using the package from start-to-finish, data download to plotting. The sub-headers define different end goals. Publicly available data is used so that the users’ results can be compared here to make sure everything is operating correctly.
Downloading an example fasta file (Arabidopsis TAIR10):
download.file("https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz", destfile = "./arabidopsis_tair10.fasta.gz", mode = "wb")The first step is to load the library
library(ggenomics)After loading the library, read in the fasta file with ggread_fasta()
Read in the example fasta to use for ggenomics:
- This creates a DNAStringSet object of a fasta file of interest for downstream analysis.
genome <- ggread_fasta("./arabidopsis_tair10.fasta.gz")Next the user should use a data analysis function :
- In this example, the function creates a table with telomere counts.
telo.table <- telomere_plotting_table(genome, chr_names = "^\\d")
# "^\\d" is used here to specify that the chromosome names begin with a number, as we are not interested in plotting the plasmid genomes.
print(telo.table)## Chromosome Length Forward_Counts Reverse_Counts begin_telo_bp end_telo_bp
## <fctr> <int> <int> <int> <num> <num>
## 1: 1 30427671 270 1 5670 21
## 2: 2 19698289 0 49 0 1029
## 3: 3 23459830 37 0 777 0
## 4: 4 18585056 52 0 1092 0
## 5: 5 26975502 NA NA NA NA
## begin_telo_start begin_telo_end end_telo_start end_telo_end total_telo_bp
## <num> <num> <num> <int> <num>
## 1: 0 5670 30427650 30427671 5691
## 2: 0 0 19697260 19698289 1029
## 3: 0 777 23459830 23459830 777
## 4: 0 1092 18585056 18585056 1092
## 5: 0 NA NA 26975502 NA
## normalized_total_telo_size
## <num>
## 1: 4.776479e-05
## 2: 8.636438e-06
## 3: 6.521392e-06
## 4: 9.165199e-06
## 5: NA
NOTE: It is always a good idea to inspect the table and ensure you are seeing what is expected. In this case, Chromosome 5 had no detected telomeric repeat, and so it has NA values.
Then, the user can utilize the ggenom() function paired with the
geom_telplot() function to create a telomere plot:
- If the user wants to create a telomere plot, the set
plot = "telplot". The options for possibleplotargument values can be found in Arguments.
ggenom(telo.table, plot = "telplot") +
geom_telplot()There are some arguments within geom_telplot() to specify shape and
color:
ggenom(telo.table, plot = "telplot") +
geom_telplot(chr_color = "bisque2", tel_color = "darkgreen", tel_shape = 18)Since all plots are ggplot2 based, they can be edited and adjusted
like any ggplot:
- The user can add adjustments with
+, just like inggplot2.
ggenom(telo.table, plot = "telplot") +
geom_telplot(chr_color = "bisque2", tel_color = "darkgreen", tel_shape = 18) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "Mb")) +
labs(y = "Sequence Length", x = "Chromosome", size = "Telomere Size", title = "ggenomics Telomere Plot") +
theme_classic(base_size = 6) +
theme(legend.position = "bottom",
legend.key.size = unit(0.2, "cm"),
plot.title = element_text(hjust = 0.5, face = "bold"))Disclaimer: These functions may not be actively maintained, and users should use updated alternatives when possible.
These functions were originally from the ggideo package. While that
package has been archived, the functions will continue to exist in
ggenomics, albeit with little continuous upkeep.
Below is an example of how to use these legacy functions:
ggideo()is used to plot telomere plots of primary assemblies.
library(ggenomics)
# Generate data and plot, stored as a list
genome.plot <- ggideo("./arabidopsis_tair10.fasta.gz", chr_names = "^\\d")- Print the table.
genome.plot$genomic.table- Print the plot.
genome.plot$ideogramggideo_diploid()is used to plot telomere plots of haplotype phased diploid assemblies. The haplotypes can be two separate fasta files, or a fasta file with both haplotypes present. The haplotypes should be identified with “_hap1” and “_hap2”.- Example of the two separate fasta files.
library(ggenomics)
# Generate data and plot, stored as a list
genome.plot <- ggideo_diploid("./genome_hap1.fasta.gz", "./genome_hap2.fasta.gz")## Joining with `by = join_by(Chromosome, Length, Forward_Counts, Reverse_Counts,
## begin_telo_bp, end_telo_bp, begin_telo_start, begin_telo_end, end_telo_start,
## end_telo_end, total_telo_bp, normalized_total_telo_size, Hap)`
- Print the table.
genome.plot$genomic.table- Print the plot.
genome.plot$ideogram- Example of usage with both haplotypes in one combined fasta file.
library(ggenomics)
# Generate data and plot, stored as a list
genome.plot <- ggideo_diploid(combined_hap_fasta = "./genome_combohap.fasta.gz",
string_remove = "_hap\\d_RagTag")## Joining with `by = join_by(Chromosome, Length, Hap, Forward_Counts,
## Reverse_Counts, begin_telo_bp, end_telo_bp, begin_telo_start, begin_telo_end,
## end_telo_start, end_telo_end, total_telo_bp, normalized_total_telo_size)`
- Print the table.
genome.plot$genomic.table- Print the plot.
genome.plot$ideogramIf you encounter any issues or have questions, you can:
- Check the detailed package documentation using
?function_namein R - Submit an issue on the GitHub repository
Contributions are welcome! To contribute:
- Fork the repository on GitHub.
- Make your changes in a new branch.
- Submit a pull request with a detailed description of your changes.
This package is licensed under the MIT License. See the LICENSE file for details.





