Skip to content

Analysis of Gene Expression @ University of Chemistry and Technology in Prague

License

Notifications You must be signed in to change notification settings

gorgitko/analysis_of_gene_expression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Analysis of Gene Expression @ University of Chemistry and Technology in Prague

These materials are for the course Analysis of Gene Expression taught at the University of Chemistry and Technology in Prague, and guaranteed by Department of Informatics and Chemistry in the study programme Bioinformatics (available for the bachelor, master, and PhD. levels).

The authors are from the Laboratory of Genomics and Bioinformatics at the Institute of Molecular Genetics of the Czech Academy of Sciences:

In case of suggestions or problems, create a new issue. We will be happy to answer your questions, integrate new ideas, or resolve any problems 😊

Lectures

Recordings and materials for theoretical lectures are stored at school MS Teams, and currently available only to course participants.

Exercises

Prerequisites

We expect all participants to have a basic knowledge of base R and Linux shell (bash). Links to relevant materials can be found in E01 - Intro.

Software prerequisites

We are using virtual machines (VMs) with images based on Debian 10 and including all the necessary software (R 4.0, RStudio Server, conda, and various tools). We gratefully thank to the Metacentrum Cloud team for a great assistance with virtual machines ❤️

However, it is possible to install all the stuff in order to have the same environment as our VMs offer (or be very close to it). Generally, we recommend to work on Linux-based system (our tip: Linux Mint).

Getting the exercise files

Just download and unzip this repository. Additional data files for E07 - RNA-seq must be downloaded, see the relevant section below.

R dependencies

You need R 4.0+ and Bioconductor 3.12+ installed. We recommend to use RStudio IDE for programming.

A lockfile for renv is included - it captures all packages needed to run the exercises. Moreover, renv ensures all packages are installed to a local R library, and thus, the installation doesn't pollute the system library.

To start the installation of required packages:

  1. Create a new RStudio project in Exercises/ directory. If you are not using RStudio, just change R's working directory to Exercises/.
  2. Start R.
  3. Run renv::init(). This will create a new project-specific library and install packages from renv.lock. If renv is not available, install it first by install.packages("renv").

Other tools

Other tools could be installed through your OS package manager or the conda tool (see E01 - Intro). The latter is recommended for bioinformatics tools, which are mainly used during RNA-seq exercises.

Solutions of assignments

Due to educational purposes, those are held in a private repository and available upon request.


Overview of exercises

E01 - Intro (Rmd) - Jiri Novotny

  • Some information about our virtual machines and files.
  • sshfs - mount directory on a remote server
  • tmux - termimal multiplexer
  • fish - a friendly, interactive shell
  • conda - package and virtual environment manager
  • Links to beginner base R tutorials and other useful stuff.

E02 - Intro to R (Rmd) - Jiri Novotny

  • Introduction to RMarkdown (Rmd).
  • Reproducible R (project-oriented workflow, consistent paths using here(), namespace conflicts, renv, etc.).
  • Installing R packages.
  • Debugging R.
  • Writing your own functions.
  • Vectorized operations, avoiding for loops, parallelization.
  • Introduction to tidyverse
    • Overview of tidy data and non-standard/tidy evaluation.
    • magrittr - pipe operator
    • tibble - enhanced data.frame
    • dplyr - data manipulation
    • tidyr - tools for tidy data
    • stringr - consistent wrappers for common string operations
    • glue - string interpolation
    • purrr - functional programming tools
    • ggplot2
      • Basic philosophy and usage.
      • Libraries extending the ggplot2.
      • Additional themes.
  • Other useful libraries
    • janitor - table summaries
    • plotly - interactive HTML plots
    • heatmaply - interactive HTML heatmaps
    • pheatmap - pretty heatmaps in base R
    • ComplexHeatmap - introduction (Rmd)
    • BiocParallel - parallelized lapply() and others

E03 - qPCR (Rmd) - Jiri Novotny

  • Main purpose of this exercise is to practice basic R on a small dataset and to implement a basic set of (mainly visualization) functions, which will be used later for microarray and RNA-seq data.
  • Implemented functions are located in age_library.R, skeletons are in age_library_empty.R.

E04 - microarrays (Rmd) - Jiri Novotny

  • Exercise on Affymetrix microarray analysis.
  • Reading in data, technical and biological quality control, normalization, differential expression, reporting.

E05 - multiple testing issue (Rmd) - Michal Kolar

  • Demonstration of multiple testing issue correction methods on fair/skewed coins.

E06 - IGV browser - Michal Kolar

  • Files for practising IGV usage.

E07 - RNA-seq - Jiri Novotny

Additional data files must be downloaded prior from here. If you are working on a remote server, you can use wget for downloading: wget https://onco.img.cas.cz/novotnyj/age/AGE2021_data.tar. Then decompress the downloaded archive to Exercises/ directory, e.g. tar xzf AGE2021_data.tar -C /path/to/Exercises.

(These data actually include also the output from this exercise, and so they are so large. TODO: also provide data only needed to begin this exercise - reference FASTAs and GTF, sample FASTQs etc.)

01 - technical quality control and trimming (Rmd)

  • Downloading from SRA (fasterq-dump).
  • Technical quality control (FastQC, MultiQC).
  • Read trimming (Trimmomatic).

02 - quantification (Rmd)

  • Downloading reference files (genome, annotation, etc.).
  • Filtering out rRNA and tRNA (SortMeRNA).
  • Two quantification pipelines:
    • Aligning to genome (GSNAP), quality control of the alignment (RSeQC, preseq) and counting overlaps (featureCounts).
    • Mapping to transcriptome (Salmon).
  • Importing count matrix to R (tximport, DESeq2).
  • Using DESeqDataSet.

03 - exploratory analysis (Rmd)

  • Running DESeq2.
  • Gene annotation.
  • Count transformations, TPM calculation.
  • PCA, hierarchical clustering, boxplots.

04 - differential expression (Rmd)

  • Using DESeq2 - contrasts, interactions, independent filtering, LFC shrinkage.
  • Reporting results: MA plot, volcano plot, boxplots, ReportingTools.

05 - Gene Set Enrichment Analysis (ORA, GSEA, SPIA) (Rmd)

  • Gene set databases.
  • Data preparation.
  • ORA (goseq).
  • GSEA by Subramanian (clusterProfiler) + visualization.
  • Signaling pathway impact analysis (SPIA).
  • Viewing data in KEGG (pathview).
  • Online tools.

E08 - single-cell RNA-seq (Rmd) - Jiri Novotny

  • Introduction, software overview, and links to tutorials, lists and other readings.

About

Analysis of Gene Expression @ University of Chemistry and Technology in Prague

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages