These introductory R sessions aim to teach participants without programming experience the basics of the R statistical programming language for reproducible data analysis. R is a freely available programming environment that is well-suited for common activities in data analysis including complex data manipulation, statistical analysis, automation, and publication-quality data visualization. We will introduce basic concepts of R programming as well as more generalizable best practices in working with laboratory data.
The goals for this course are for participants to be comfortable performing the following basic tasks on a data set:
- Import a csv, text, or Excel file
- Create a relevant subset of data for analysis
- Add columns to the data set based on calculations or manipulation of existing columns
- Calculate summary statistics for the entire data set and/or important subgroups of data
- Perform basic statistical tests on a data set
- Develop simple visualizations that summarize data such as barplots and scatterplots
- Produce a reproducible report so others can clearly understand and repeat an analysis
Instructors:
- Patrick Mathias
- Niklas Krumm
- Andrew Laitman
- A computer (macOS, Windows, or Linux) with Internet access. If you are attending in person in NW150 you will need a laptop to complete the exercises. The course will simultaneously be taught over Zoom so you can participate via desktop/workstation as well.
- Please complete the following survey so we can better understand your programming and R experience and what you want out of the course: Pre Course Survey
- The program that we will be using to interact with R during the course is called RStudio. We will be using a cloud based RStudio server, hosted at RStudio.Cloud, in our workshop. Our intent in using a cloud-based platform to minimize possible in-class setup issues.
- Please follow the instructions in this presentation to setup an RStudio.Cloud account. Click the Download button to download the presentation to your computer and open it.
- Link to RStudio.Cloud workspace
- Note: Some older internet browsers may not be compatible with RStudio.cloud. See this web page for additional information.
- (Optional) Install RStudio Desktop on the computer you will be using for work as well. See instructions below. Our Help Desk staff can help if you don't already have the software install on a UW issued workstation/laptop.
Working with our cloud based RStudio instance will be the most straightforward way to proceed through the sessions. However, in the long term, you may need R and RStudio installed on your own computer in order to work on with your own data. You can find a video with step by step instructions for installing on Mac or PC by following the links below:
Please complete each step in the video in turn including the final step, installing the tidyverse packages. Note that if you are installing RStudio and doing data analysis on your own computer, you are responsible for ensuring the security of the data that you analyze on your system. This includes ensuring access control and hard drive encryption if you are handling protected health information, among other considerations. Please consult with the Laboratory Medicine and Pathology Informatics faculty and staff if you are unsure about the suitability of a computer system for use with sensitive information.
There are multiple ways to access and interact with the content, depending on whether you choose to proceed through the workshop using the cloud based RStudio or one on your own laptop.
- If you choose to use the the cloud based RStudio instance, all the course content will be pre-loaded.
- If you would prefer to run RStudio on your own computer:
- Download the course material from the course github repository as a *.zip file from here
- Unzip the file to a convenient location (your desktop or documents folder)
- When you open RStudio, set this location as your working directory
The content for this course was originally developed for the 2019 Pathology Informatics Summit workshop. The original contributors and content they were responsible for were as follows:
- Joe Rudolf: Introduction to R & RStudio
- Patrick Mathias: Reproducible Reporting
- Amrom Obstfeld: Data Transformation & Exploratory Data Analysis
- Stephan Kadauke: Data Visualization
- Dan Herman: Data Summary & Statistics
All of the course instructors have previous experience implementing and executing R workshops at a variety of venues and this course is a product of these past experiences. The workshop also integrates content, best practices, and lessons from a variety of educators in the R community. We would like to specifically acknowledge:
- Data Science in the Tidyverse, a RStudio course with materials posted online
- MSACL Data Science 201, a course produced by Patrick Mathias and several collaborators, presented at the Mass Spectrometry: Applications to the Clinical Lab meeting.
- Stephan Kadauke's R workshop for Pathology trainees and faculty, developed at the Massachusetts General Hospital and the Hospital of the University of Pennsylvania
- Steve Master and Dan Holmes's AACC Introduction to R Workshop
- R for Data Science, the online textbook by Garrett Grolemund and Hadley Wickham, is invaluable in navigating the tidyverse and learning R in general
- Blog posts and documentation by Jenny Bryan helped steer the project content and as well as some discussion about packages
- Amy Willis' Advanced R Course repository as a resource for understanding content in a longer, advanced R course
- Keith Baggerly and Karl Broman's Reproducible Research module at the Summer Institute in Statistics for Big Data - a big thank you to Keith Baggerly for all of his input and guidance!
- Greg Wilson's Teaching Tech Together, which offers practical advice about teaching programming.
- Claus Wilke's Fundamentals of Data Visualization, a compendium of Do's and Don'ts of data visualization.
- Method validation and some other content has been borrowed from the basic R course at AACC