-
Notifications
You must be signed in to change notification settings - Fork 153
/
search_index.json
executable file
·13 lines (13 loc) · 65.2 KB
/
search_index.json
1
2
3
4
5
6
7
8
9
10
11
12
13
[
["index.html", "edav.info/ About 0.1 Table of Contents 0.2 Contact", " edav.info/ Zach Bogart, Joyce Robbins 2018 About Students often want to know how they can excel in a course and we understand that desire. The standard answer given is usually something like: Just read the syllabus, focus on the topics discussed therein, and be able to understand their nuances. — Typical Prof This answer is often given after a quick sigh and delivered in a surprisingly condescending tone. We don’t like this answer. Our answer is to provide you with edav.info/. This site is one of the best ways to help you with this course. We hope that you find the confidence to dive in and explore this resource and its examples. This resource is specifically tailored to the Exploratory Data Analysis and Visualization course offered at Columbia University. However, anyone interested in working with data in R will benefit from perusing these pages. 0.1 Table of Contents Click on a banner to go to the desired page. If you’re wondering, here’s an explanation of what the banner colors mean. 0.1.1 Blue Pages Blue == INFO: The blue banners signal pages that contain basic information. 0.1.1.1 Introduction 0.1.1.2 R Basics 0.1.1.3 Final Project Notes 0.1.2 Green Pages Green == DOC: The green banners signal pages that contain more compact documentation. 0.1.2.1 Chart: Histogram 0.1.2.2 Chart: Boxplot 0.1.2.3 Chart: Scatterplot 0.1.2.4 Networks 0.1.3 Red Pages Red == WALK: The red banners signal pages that contain more extensive walkthroughs. 0.1.3.1 Walkthrough: Iris Example 0.1.4 Yellow Pages Yellow == REF: The yellow banners signal pages that contain simple collections of references. 0.1.4.1 Publishing with R 0.1.4.2 General Resources 0.2 Contact Zach Bogart: Website / Twitter Joyce Robbins: Columbia Profile / Website / Twitter / Github "],
["intro.html", "1 Introduction 1.1 Everything you need for EDAV 1.2 What the Banners Mean 1.3 Improving this resource 1.4 Fun Stuff: T-Shirts", " 1 Introduction 1.1 Everything you need for EDAV This resource has everything you need and more to be successful with R, this EDAV course, and beyond. With this resource, we try to give you a curated collection of tools and references that will make it easier to learn how to work with data in R. In addition, we include sections on basic chart types/tools so you can learn by doing. There are also several walkthroughs where we work with data and discuss problems as well as some tips/tricks that will help you. We hope this resource serves you well. 1.2 What the Banners Mean The banners at the top of each page are an effort to improve your ability to navigate this resource. Each one is color-coded based on its content and has a unique icon to improve recall. There are four types: Blue == INFO: The blue banners signal pages that contain basic information. Examples of blue pages include this introduction page and the basics page, which explains how to setup R/RStudio as well as ways to get help if you need it. Blue pages are the help desk of this resource: look to them if you are lost and need to find your way. Red == WALK: The red banners signal pages that contain more extensive walkthroughs. An example of a red page is the iris walkthrough, where a well-known dataset is presented as a pretty scatterplot and steps are shown from start to finish. This page type is the most thorough: it trys to provide full documentation, explanations of design choices, and advice on best practices. It’s like going to office hours and having a great clarifying chat with a course assistant…in article form. If you would like to see a fully-worked-through example of something with a lot of guidance along the way, check out the red pages. Green == DOC: The green banners signal pages that contain more compact documentation. An example of a green page is the histogram page, which includes simple examples of how to create histograms, when to use them, and things to be aware of/watch out for. The green pages hold your hand much less than the red pages: they explain how to use a chart/tool using examples and simple terms. If you have an idea in mind and are just wondering how to execute it, the green pages will help fill in those gaps. Yellow == REF: The yellow banners signal pages that contain simple collections of references. An example of a yellow page is the external resources page, which is a list of materials that you can look through and learn from. Yellow pages have the least amount of hand-holding: they are collections of resources and references that will help you learn about new things. 1.3 Improving this resource Not finding what you are looking for? Think a section could be made clearer? Consider improving edav.info/ by submitting a pull request to the github page. 1.4 Fun Stuff: T-Shirts Zach Bogart has made a few tshirts available on Teespring so you can show your love for EDAV and R. Hope you enjoy. P.S. Designing a cool shirt or sticker is a great addition to your community contribution. It has to be cool, though :) EDAV Logo Shirt R Shirt Tidyverse R Shirt "],
["basics.html", "2 R Basics 2.1 Essentials Checklist 2.2 Getting Started 2.3 Getting More Specific 2.4 Getting Help", " 2 R Basics So…there is soooo much to the world of R. Textbooks, cheatsheets, exercises, and other buzzwords full of resources you could go through. There are over 12700 packages on CRAN, the network through which R code and packages are distributed. It can be overwhelming. However, bear in mind that R is being used for a lot of different things, not all of which are relevant to EDAV. To help you navigate the landscape, here we provide a collection of resources that you should be familiar with in the context of this course. This is not to say that any of these resources are prerequisites, but they will come up in the course and we want to give you places to learn about them. Since people come with a variety of backgrounds, we will try to provide the essentials as well as some resources for more advanced users. Do not feel you have to go through all of these resources, but know that they are here if/when you need them. :) 2.1 Essentials Checklist In an effort to get everyone on the same page, here is a checklist of essentials so you can get up and running with this course. It will echo/reference a lot of info said below, but we want to make sure everything mentioned is clear and understood. Okay, then. Here are the essentials, in checklist form: Download R and RStudio: This is the biggest thing to do by far. Make sure to download both R and RStudio, as mentioned in Setting up R and RStudio. Learn your way around RStudio: RStudio is powerful…if you know how to use it. Take the time to look through the DataCamp sections on the RStudio IDE so you feel comfortable (see Use RStudio Like a Pro section). Try something!: Getting comfortable with an IDE is all about practice. So while the DataCamp vids are great, don’t solely rely on them. Try things out for yourself! Here are some things to play around with: Create an R Script file, paste in print("Hello, World!"), and run it Create an R Markdown file and have it generate an HTML page Download a package like tidyverse or MASS Do some math in the console Learn how to get help: Make sure you are comfortable searching for answers when you get stuck. See the above section on getting help for some…help. Get the Textbook: This course uses Graphical Data Analysis with R as its textbook. Here is an Amazon Link for a physical copy and a link to the book’s website. Setup DataCamp Account: A lot of the references and support materials discussed in edav.info/ are from DataCamp, an online collection of courses/articles on data science. Some of the sections are free, but most are behind a paywall. If you are enrolled in this course, you should receive an invitation to create an account that will allow you full-access to the site. 2.2 Getting Started 2.2.1 Setting up R and RStudio It is super important to get up and running with R and RStudio as soon as you can. This video from DataCamp pretty much covers it. Know that you will be downloading two separate things: R, which is a programming language; and RStudio, which is an IDE (integrated development environment…fancy tool for working with R) that will make working with R a lot more enjoyable. 2.2.2 Use RStudio Like a Pro Great! RStudio is up and running on your computer! Now make sure you get comfy with what it can do. Don’t know your way around the RStudio IDE? I highly recommend this DataCamp course. Sections from Part 1 (Orientation, Programming, and Projects) are the most relevant for this course. They include videos about all the regions in RStudio, how to program efficiently/effectively in the IDE (gotta love those keyboard shortcuts), and the benefits of setting up R projects. A little hazy on that last sentence? The course will help. More Advanced: Another option is this RStudio webinar. Just want a quick reference to brush up with? Take a look at the RStudio Cheatsheets page. More Advanced: Want to make the RStudio IDE your own? Look into modifying the preferences. You can customize the look of the IDE like default colors and typefaces, tweak default behaviors like clearing the environment on load, and integrate a session with a git repository. If something about the IDE bugs you, chances are you can make it more to your liking. 2.2.3 Learning About R R is just like any language, programming or otherwise: you need to use it to get used to it. Just starting out in R? Check out this DataCamp course for a quick introduction. For this course, you can skim/mostly ignore matrices and lists (Parts 3 & 6). More Advanced: Want to curl up with a good book about R? We recommend R for Data Science. It jumps right in, but is quite extensive. Focus less on Part IV (Model). 2.3 Getting More Specific 2.3.1 Installing Packages A lot of the cool stuff comes from installing packages into R. How do you install packages? The main function we use is install.packages("<package_name>"), which installs from CRAN, a well-known place where packages are stored. Then, once installed, you can use packages by calling them within library(). Still confused? This DataCamp video should help explain the process. Also be sure to try the accompanying exercise to make sure you have a feel for loading a package. More Advanced: Want more info? Check out this DataCamp article on everything about installing packages in R. As well as covering the basics, this article shows you how to install packages that are not located on CRAN using devtools, as well as ways to monitor the status/health of your installed packages. 2.3.2 Tidyverse Don’t know what the tidyverse is? It’s great and we use it throughout this course. Specifically, ggplot2 and dplyr, two packages within the Tidyverse. What’s ggplot? Check out this DataCamp course. This course is split up into three parts and it is quite long, but it does go over pretty much everything ggplot has to offer. If you are starting out, stick with Part 1. What’s dplyr? Make friends with this DataCamp course. It goes through the main dplyr verbs: select, mutate, filter, arrange, summarise; as well as the lovely pipe operator. More Advanced: Want case studies to go through? Try this one or this one. 2.3.3 Importing Data We often will need to pull data into RStudio to work with it. “Pull data”? I’m already confused. But wait! Here’s a DataCamp course on importing data using dplyr. Note: This course explains how to import every kind of data format under the sun…all you need to be familiar with for this course (mostly) is pulling in CSV files using read_csv. So, if you are overwhelmed, just stick to the read_csv stuff. More Advanced: Importing every data format under the sun you say? I want to know how to do that. Here’s Part 1, as well as Part 2, which focuses on databases and HTTP requests. Go nuts. 2.3.4 R Markdown R Markdown is how you will be submitting assignments for this course. In general, it is a great way to communicate your findings to others. Don’t know about R Markdown? DataCamp course to the rescue! We will be using html formatting so focus on that. There is also an RStudio webinar about it. More Advanced: The R Markdown page from RStudio has lessons with extensive info. Also, more cheatsheets. More Advanced: Want to jump right in? Open a new R Markdown file (File > New File > R Markdown…), and set its Default Output Format to HTML. You will get a R Markdown template you can tinker with. Try knitting the document to see what everything does. 2.4 Getting Help via https://dev.to/rly First off…breeeeeathe. We can fix this. There are a bunch of resources out there that can help you. 2.4.1 Things to Try Remember: Always try to help yourself! This article has a great list of tools to help you learn about anything you may be confused by. This includes learning about functions and packages as well as searching for info about a function/package/problem/etc. This is the perfect place to learn how to get the info you need. The RStudio Help menu (in the top toolbar) is a fantastic place to go for understanding/fixing any problems. There are links to documentation and manuals as well as cheatsheets and a lovely collection of keyboard shortcuts. Vignettes are a great way to learn about packages and how they work. Vignettes are like stylized manuals that can do a better job at explaining a package’s contents. For example, ggplot2 has a vignette on aesthetics called ggplot2-specs that talks about different ways you can map data to different formats. Typing browseVignettes() in the console will show you all the vignettes for all of the packages you have installed. You can also see vignettes by package by typing vignette(package = "<package_name>") into the console. To run a specific vignette, use vignette("<vignette_name>"). If the vignette can’t be resolved, include the package name as well: vignette("<vignette_name", package = "<package_name>") Don’t ignore errors. They are telling you so much! If you give up because red text showed up in your console, take the time to see what that red text is saying. Learn how to read errors and what they are telling you. They usually include where the problem happened and what R thinks the problem stems from. More Advanced: Learn to love debugger mode. Debugging can have a steep learning curve, but huge payoffs. Take a look at these videos about debugging with R. Topics include running the debugger, setting breakpoints, customizing preferences, and more. Note: R Markdown files have some limitations for debugging, as discussed in this article. You could also consider working out your code in a .R file before including it in your R Markdown homework submission. 2.4.2 Help Me, R Community! Relax. There are a bunch of people using the same tools you are. Your fellow classmates are a good place to start! Post questions to Piazza to see how they could help. There is a lot of great documentation on R and its functions/packages/etc. Get comfy with R Documentation and it will help you immensely. More Advanced: There is a vibrant RStudio Community page. Also, R likes twitter. Check out #rstats or maybe let Hadley Wickham know about a wonky error message. "],
["project.html", "3 Final Project Notes 3.1 Overview 3.2 General Info 3.3 Outline 3.4 FAQ 3.5 Rubric 3.6 Executive Summary Notes", " 3 Final Project Notes 3.1 Overview This section goes over what’s expected for the final project. General Note: Please note that this sheet cannot possibly cover all the “do’s and don’ts” of data analysis and visualization. You are expected to follow all of the best practices discussed in class throughout the semester. 3.2 General Info 3.2.1 Goal The goal of this project is to perform an exploratory data analysis / create visualizations with data of your choosing in order to gain preliminary insights on questions of interest to you. 3.2.2 Teams You must work in teams of 2-4 people. (If you have specific interests you should try to find partners on Piazza first as we will not be able to match on specific criteria – we will simply assign groups in the order in which responses come in.) 3.2.3 Topics Start with a topic / question that interests you and then look for data! 3.2.4 Data Choose data from the original source: that is, one that is not included in R (or similar), nor used in Kaggle (or similar) competitions, nor relatively well-known/well-trodden through some other forum. If in doubt, ask! A few examples are: NYC Open Data US Bureau of Labor Statistics 3.2.5 Analysis You have a lot of freedom to choose what to do, as long as you restrict yourselves to exploratory techniques (rather than modeling / prediction approaches). In addition, your analysis must be clearly documented and reproducible. 3.2.6 Feedback At any point, you may ask the head TA or the instructor (jtr13) for advice. Our primary role in this regard will be to provide general guidance on your choice of data / topic / direction. As always, you are encouraged to post specific questions to Piazza, particularly coding questions and issues. You may also volunteer to discuss your project with the class in order to get feedback–if you’d like to do this, email the instructor to schedule a date. 3.2.7 Peer Review A portion of your grade is based on the feedback you give to other groups. After the due date, each individual will be assigned two project groups to review, and instructions will be provided. Note: part of the grade you receive for the class is based on the quality of review that you write, not on the feedback that your project receives. Your grade for the project (as for all other assignments for the class) will be determined solely by the instructor and TAs. 3.2.8 Report Format With the exception of the interactive part, your project should be submitted to CourseWorks in the same manner as homework assignments, as .Rmd and .html files, with graphs / output rendered. You will lose points if we have trouble reading your file, need to ask you to resubmit with graphs visible, if links are broken, or if we have other difficulties accessing your materials. It’s ok if code is in different files and different places, just make sure there are working links in your report to these locations. Note: Using Markdown + code chunks is supposed to make combining code, text and graphs easier. If it is making it more difficult, you are probably trying to do something that isn’t well suited to the tool set. Focus on the text and graphs, not the formatting. If you’re not sure if something is important to focus on or not, please ask. Advice: don’t wait to start writing. Your overall project will undoubtedly be better if you give up trying to get that last graph perfect or the last bit of analysis done and get to the writing! 3.2.9 A Note on Style You are encouraged to be as intellectually honest as possible. That means pointing out flaws in your work, detailing obstacles, disagreements, decision points, etc. – the kinds of “behind-the-scene” things that are important but often left out of reports. You may use the first person (“I”/“We”) or specific team members’ names, as relevant. 3.3 Outline Your report should include the following sections, with subtitles (“Introduction”, etc.) as indicated: 3.3.1 Introduction Explain why you chose this topic, and the questions you are interested in studying. List team members and a description of how each contributed to the project. 3.3.2 Description of Data Describe how the data was collected, how you accessed it, and any other noteworthy features. 3.3.3 Analysis of Data Quality Provide a detailed, well-organized description of data quality, including textual description, graphs, and code. 3.3.4 Main Analysis (Exploratory Data Analysis) Provide a detailed, well-organized description of your findings, including textual description, graphs, and code. Your focus should be on both the results and the process. Include, as reasonable and relevant, approaches that didn’t work, challenges, the data cleaning process, etc. 3.3.5 Executive Summary (Presentation-style) Provide a short nontechnical summary of the most revealing findings of your analysis written for a nontechnical audience. The length should be approximately two pages (if we were using pages…) Take extra care to clean up your graphs, ensuring that best practices for presentation are followed. Note: “Presentation” here refers to the style of graph, that is, graphs that are cleaned up for presentation, as opposed to the rough ones we often use for exploratory data analysis. You do not have to present your work to the class! However, you may choose to present your work as your community contribution, in which case you need to email me to set a date before the community contribution due date. (The presentation itself may be later.) 3.3.6 Interactive Component Select one (or more) of your key findings to present in an interactive format. Be selective in the choices that you present to the user; the idea is that in 5-10 minutes, users should have a good sense of the question(s) that you are interested in and the trends you’ve identified in the data. In other words, they should understand the value of the analysis, be it business value, scientific value, general knowledge, etc. Interactive graphs must follow all of the best practices as with static graphs in terms of perception, labeling, accuracy, etc. You may choose the tool (D3, Shiny, or other) The complexity of your tool will be taken into account: we expect more complexity from a higher-level tool like Shiny than a lower-level tool like D3, which requires you to build a lot from scratch. Make sure that the user is clear on what the tool does and how to use it. Publish your graph somewhere on the web and provide a link in your report in the interactive section. The obvious choices are blockbuilder.org to create a block for D3, and shinyapps.io for Shiny apps but other options are fine. You are encouraged to share experiences on Piazza to help classmates with the publishing process. As applicable, all of the following will be considered in the grading process: Choice of data and plot types to present Clear relevance to question(s), project in general Design of interactive component(s) Clarity of presentation, including instructions Technical execution (include a description of what you would work on in the future, what you’ve attempted, etc. so we know it’s on your radar) 3.3.7 Conclusion Discuss limitations and future directions, lessons learned. 3.4 FAQ How long should the project be? It should take the reader approximately 15-20 minutes to read the report. We cannot provide a specific number of graphs or pages since there are so many variables. Use your judgment to cover all of the important material without being repetitive. You can report on what you’ve done without including all of the graphs; for example, if you looked at maps of each of the fifty states you can include 1 or 2 as examples. Do we have to present the project to the class? No. Presenting your project as your community contribution is optional. Someone has already used the same data, is that ok? Yes. As long as you get the data from the original source, not a site like Kaggle, you’re fine. You can check with the professor if you want to be sure. I spent 30 minutes looking at my data, and then 1000 hours building this super cool interactive app so users can analyze the data themselves. Can’t you count the interactive part for 95% of my grade? No. While skill sets overlap in the real world, and it’s important to know something about building things, the assumption is that you are doing the work of the data scientist: actually analyzing the data rather than building tools for someone else to do it. The former (the data!) has been the main focus of this class and therefore is the primary focus of the final project. 3.5 Rubric Introduction (including choice of data, questions), Team description 10 points Description of Data 10 points Graphical Analysis of Data Quality 10 points Main Analysis (focus on quality of EDA choices / techniques) 20 points Executive Summary (focus on quality of presentation choices / techniques) 20 points Interactive Component 20 points Conclusion 10 points TOTAL = 100 points Points will be deducted for technical flaws (problems opening files, following links, etc.), for not citing sources, and for lack of reproducibility. Late Submissions: 10 points will be deducted per day. Plagiarism of any kind will not be tolerated and will result in a grade of 0 for the project. 3.6 Executive Summary Notes The executive summary should be a well-formatted, presentable final product of your results. Here are some notes to consider when putting it together: Title, axis labels, tick mark labels, and legends should be comprehensible (easy to understand) and legible (easy to read / decipher). Tick marks should not be labeled in scientific notation or with long strings of zeros, such as 3000000000. Instead, convert to smaller numbers and change the units: 3000000000 becomes “3” and the axis label “billions of views”. Units should be intuitive (An axis labeled in month/day/year format is intuitive; one labeled in seconds since January 1, 1970 is not.) The font size should be large enough to read clearly. The default in ggplot2 is generally too small. You can easily change it by passing the base font size to the theme, such as + theme_grey(16) (The default base font size is 11). The order of items on the axes and legends should be logical. (Alphabetical is usually not the best option.) Colors should be color-vision-deficiency-friendly. If categorical variable levels are long, set up the graph so the categorical variable is on the y-axis and the names are horizontal. A better option, if possible, is to shorten the names of the levels. Not all EDA graphs lend themselves to presentation, either because the graph form is hard to understand without practice or it’s not well labeled. The labeling problem can be solved by adding text in an image editor. The downside is that it is not reproducible. If you want to go this route, for the Mac, Keynote and Paintbrush are good, free options. Err on the side of simplicity. Don’t, for example, overuse color when it’s not necessary. Ask yourself: does color make this graph any clearer? If it doesn’t, leave it out. Test your graphs on nontechnical friends and family and ask for feedback. Above all, have fun with it :) "],
["histo.html", "4 Chart: Histogram 4.1 Overview 4.2 tl;dr 4.3 Simple Examples 4.4 When to use 4.5 Considerations 4.6 Theory 4.7 External Resources", " 4 Chart: Histogram 4.1 Overview This section covers how to make histograms. 4.2 tl;dr Gimme a full-fledged example! Here’s an application of histograms that looks at how the beaks of Galapagos finches changed due to external factors: And here’s the code: library(Sleuth3) # data library(ggplot2) # plotting # load data finches <- Sleuth3::case0201 # finch histograms by year with overlayed density curves ggplot(finches, aes(x = Depth, y = ..density..)) + # plotting geom_histogram(bins = 20, colour = "#80593D", fill = "#9FC29F", boundary = 0) + geom_density(color = "#3D6480") + facet_wrap(~Year) + # formatting ggtitle("Severe Drought Led to Finches with Bigger Chompers", subtitle = "Beak Depth Density of Galapagos Finches by Year") + labs(x = "Beak Depth (mm)", caption = "Source: Sleuth3::case0201") + theme(plot.title = element_text(face = "bold")) + theme(plot.subtitle = element_text(face = "bold", color = "grey35")) + theme(plot.caption = element_text(color = "grey68")) For more info on this dataset, type ?Sleuth3::case0201 into the console. 4.3 Simple Examples Whoa whoa whoa! Much simpler please! Let’s use a very simple dataset: # store data x <- c(50, 51, 53, 55, 56, 60, 65, 65, 68) 4.3.1 Histogram using Base R # plot data hist(x, col = "lightblue", main = "Base R Histogram of x") For the Base R histogram, it’s advantages are in it’s ease to setup. In truth, all you need to plot the data x in question is hist(x), but we included a little color and a title to make it more presentable. Full documentation on hist() can be found here 4.3.2 Histogram using ggplot2 # import ggplot library(ggplot2) # must store data as dataframe df <- data.frame(x) # plot data ggplot(df, aes(x)) + geom_histogram(color = "grey", fill = "lightBlue", binwidth = 5, center = 52.5) + ggtitle("ggplot2 histogram of x") The ggplot version is a little more complicated on the surface, but you get more power and control as a result. Note: as shown above, ggplot expects a dataframe, so if you are getting an error where “R doesn’t know what to do” like this: ggplot dataframe error make sure you are using a dataframe. 4.4 When to use Use a histogram to show the distribution of one continuous variable. The y-scale can be represented in a variety of ways to express different results: Count: Number of points that fall in each bin Relative frequency: (Count) / (Total Number of datapoints) Cumulative Frequency: Accumulation of all previous Relative frequencies Density: (Relative Frequency) / (binwidth) 4.5 Considerations 4.5.1 Bin Boundaries Be mindful of the boundaries of the bins and whether a point will fall into the left or right bin if it is on a boundary. # format layout op <- par(mfrow = c(1, 2), las = 1) # right closed hist(x, col = "lightblue", ylim = c(0, 4), xlab = "right closed ex. (55, 60]", font.lab = 2) # right open hist(x, col = "lightblue", right = FALSE, ylim = c(0, 4), xlab = "right open ex. [55, 60)", font.lab = 2) 4.5.2 Bin Number The default bin number of 30 in ggplot2 is not always ideal, so consider altering it if things are looking strange. You can specify the width explicitly with binwidth or provide the desired number of bins with bins. # default...note the pop-up about default bin number ggplot(finches, aes(x = Depth)) + geom_histogram() + ggtitle("Default with pop-up about bin number") ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. Here are examples of changing the bins using the two ways described above: # using binwidth p1 <- ggplot(finches, aes(x = Depth)) + geom_histogram(binwidth = 0.5, boundary = 6) + ggtitle("Changed binwidth value") # using bins p2 <- ggplot(finches, aes(x = Depth)) + geom_histogram(bins = 48, boundary = 6) + ggtitle("Changed bins value") # format plot layout library(gridExtra) grid.arrange(p1, p2, ncol = 2) 4.5.3 Bin Alignment Make sure the axes reflect the true boundaries of the histogram. You can use boundary to specify the endpoint of any bin or center to specify the center of any bin. ggplot2 will be able to calculate where to place the rest of the bins (Also, notice that when the boundary was changed, the number of bins got smaller by one. This is because by default the bins are centered and go over/under the range of the data.) df <- data.frame(x) # default alignment ggplot(df, aes(x)) + geom_histogram(binwidth = 5, fill = "lightBlue", col = "black") + ggtitle("Default Bin Alignment") # specify alignment with boundary p3 <- ggplot(df, aes(x)) + geom_histogram(binwidth = 5, boundary = 60, fill = "lightBlue", col = "black") + ggtitle("Bin Alignment Using boundary") # specify alignment with center p4 <- ggplot(df, aes(x)) + geom_histogram(binwidth = 5, center = 67.5, fill = "lightBlue", col = "black") + ggtitle("Bin Alignment Using center") # format layout library(gridExtra) grid.arrange(p3, p4, ncol = 2) Note: Don’t use both boundary and center for bin alignment. Just pick one. 4.6 Theory For more info about histograms and continuous variables, check out Chapter 3 of the textbook. 4.7 External Resources DataCamp ggplot2 Histograms Exercise: Simple interactive example of histograms with ggplot2 DataCamp Histogram with Basic R: “Tutorial for new R users whom need an accessible and easy-to-understand resource on how to create their own histogram with basic R.” ’Nuff said. DataCamp Histogram with ggplot2: Great article on making histograms with ggplot2. hist documentation: base R histogram documentation page. ggplot2 cheatsheet: Always good to have close by. "],
["box.html", "5 Chart: Boxplot 5.1 Overview 5.2 tl;dr 5.3 Simple Examples 5.4 When to use 5.5 Considerations 5.6 Theory 5.7 External Resources", " 5 Chart: Boxplot 5.1 Overview This section covers how to make boxplots. 5.2 tl;dr I want a nice example and I want it NOW! Here’s a look at the weights of newborn chicks split by the feed supplement they received: And here’s the code: library(datasets) # data library(ggplot2) # plotting # reorder supplements supps <- c("horsebean", "linseed", "soybean", "meatmeal", "sunflower", "casein") # boxplot by feed supplement with jitter layer ggplot(chickwts, aes(x = factor(feed, levels = supps), y = weight)) + # plotting geom_boxplot(fill = "#cc9a38", color = "#473e2c") + geom_jitter(alpha = 0.2, width = 0.1, color = "#926d25") + # formatting ggtitle("Casein Makes You Fat?!", subtitle = "Boxplots of Chick Weights by Feed Supplement") + labs(x = "Feed Supplement", y = "Chick Weight (g)", caption = "Source: datasets::chickwts") + theme(plot.title = element_text(face = "bold")) + theme(plot.subtitle = element_text(face = "bold", color = "grey35")) + theme(plot.caption = element_text(color = "grey68")) For more info on this dataset, type ?datasets::chickwts into the console. 5.3 Simple Examples Okay…much simpler please. Let’s use the airquality dataset from the datasets package: library(datasets) head(airquality, n = 5) ## Ozone Solar.R Wind Temp Month Day ## 1 41 190 7.4 67 5 1 ## 2 36 118 8.0 72 5 2 ## 3 12 149 12.6 74 5 3 ## 4 18 313 11.5 62 5 4 ## 5 NA NA 14.3 56 5 5 5.3.1 Boxplot using Base R # plot data boxplot(airquality, col = 'lightBlue', main = "Base R Boxplots of airquality") Boxplots with Base R are super easy. Like histograms, boxplots only need the data. In this case, we passed a dataframe with six variables, so it made separate boxplots for each variable. You may not want to create boxplots for every variable, in which case you could specify the variables individually or use filter from the dplyr package. 5.3.2 Boxplot using ggplot2 # import ggplot library(ggplot2) # plot data g1 <- ggplot(stack(airquality), aes(x = ind, y = values)) + geom_boxplot(fill = "lightBlue") + # extra formatting labs(x = "") + ggtitle("ggplot2 Boxplots of airquality") g1 ## Warning: Removed 44 rows containing non-finite values (stat_boxplot). ggplot2 requires data to be mapped to the x and y aesthetics. Here we use the stack function to combine each column of the airquality dataframe. Reading the documentation for the stack function (?utils::stack), we see the new stacked dataframe has two columns: values and ind, which we use to create the boxplots. Notice: ggplot2 warns us that it is ignoring “non-finite values”, which are the NA’s in the dataset. 5.4 When to use Boxplots should be used to display continuous variables. They are particularly useful for identifying outliers and comparing different groups. Aside: Boxplots may even help you convince someone you are their outlier (If you like it when people over-explain jokes, here is why that comic is funny.). 5.5 Considerations 5.5.1 Flipping Orientation Often you want boxplots to be horizontal. Super easy to do: just tack on coord_flip(): # g1 plot from above (5.3.2) g1 + coord_flip() ## Warning: Removed 44 rows containing non-finite values (stat_boxplot). 5.5.2 NOT for categorical data Boxplots are great, but they do NOT work with categorical data. Make sure your variable is continuous before using boxplots. Here’s an example of what not to do: library(likert) # data library(dplyr) # data manipulation # load/format data data(pisaitems) pisa <- pisaitems[1:100, 2:7] %>% dplyr::mutate_all(as.integer) %>% dplyr::filter(complete.cases(.)) # create theme theme <- theme(plot.title = element_text(face = "bold")) + theme(plot.subtitle = element_text(face = "bold", color = "grey35")) + theme(plot.caption = element_text(color = "grey68")) # create plot plot <- ggplot(stack(pisa), aes(x = ind, y = values)) + geom_boxplot(fill = "#9B3535") + ggtitle("Don't Plot Boxplots of Categorical Variables Like This", subtitle = "...seriously don't. Here, I'll make it red so it looks scary:") + labs(x = "Assessment Code", y = "Values", caption = "Source: likert::pisaitems") # bad boxplot plot + theme 5.6 Theory For more info about boxplots and continuous variables, check out Chapter 3 of the textbook. 5.7 External Resources DataCamp: Quick Exercise on Boxplots: a simple example of making boxplots from a dataset. Article on boxplots with ggplot2: An excellent collection of code examples on how to make boxplots with ggplot2. Covers layering, working with legends, faceting, formatting, and more. If you want a boxplot to look a certain way, this article will help. Boxplots with plotly package: boxplot examples using the plotly package. These allow for a little interactivity on hover, which might better explain the underlying statistics of your plot. ggplot2 Boxplot: Quick Start Guide: Article from STHDA on making boxplots using ggplot2. Excellent starting point for getting immediate results and custom formatting. ggplot2 cheatsheet: Always good to have close by. "],
["scatter.html", "6 Chart: Scatterplot 6.1 Overview 6.2 tl;dr 6.3 Simple Examples 6.4 When to use 6.5 Considerations 6.6 Modifications 6.7 Theory 6.8 External Resources", " 6 Chart: Scatterplot 6.1 Overview This section covers how to make scatterplots 6.2 tl;dr Fancy Example NOW! Gimme Gimme GIMME! Here’s a look at the relationship between brain weight vs. body weight for 62 species of land mammals: And here’s the code: library(MASS) # data library(ggplot2) # plotting # ratio for color choices ratio <- mammals$brain / (mammals$body*1000) ggplot(mammals, aes(x = body, y = brain)) + # plot points, group by color geom_point(aes(fill = ifelse(ratio >= 0.02, "#0000ff", ifelse(ratio >= 0.01 & ratio < 0.02, "#00ff00", ifelse(ratio >= 0.005 & ratio < 0.01, "#00ffff", ifelse(ratio >= 0.001 & ratio < 0.005, "#ffff00", "#ffffff"))))), col = "#656565", alpha = 0.5, size = 4, shape = 21) + # add chosen text annotations geom_text(aes(label = ifelse(row.names(mammals) %in% c("Mouse", "Human", "Asian elephant", "Chimpanzee", "Owl monkey", "Ground squirrel"), paste(as.character(row.names(mammals)), "→", sep = " "),'')), hjust = 1.12, vjust = 0.3, col = "grey35") + geom_text(aes(label = ifelse(row.names(mammals) %in% c("Golden hamster", "Kangaroo", "Water opossum", "Cow"), paste("←", as.character(row.names(mammals)), sep = " "),'')), hjust = -0.12, vjust = 0.35, col = "grey35") + # customize legend/color palette scale_fill_manual(name = "Brain Weight, as the\\n% of Body Weight", values = c('#d7191c','#fdae61','#ffffbf','#abd9e9','#2c7bb6'), breaks = c("#0000ff", "#00ff00", "#00ffff", "#ffff00", "#ffffff"), labels = c("Greater than 2%", "Between 1%-2%", "Between 0.5%-1%", "Between 0.1%-0.5%", "Less than 0.1%")) + # formatting scale_x_log10(name = "Body Weight", breaks = c(0.01, 1, 100, 10000), labels = c("10 g", "1 kg", "100 kg", "10K kg")) + scale_y_log10(name = "Brain Weight", breaks = c(1, 10, 100, 1000), labels = c("1 g", "10 g", "100 g", "1 kg")) + ggtitle("An Elephant Never Forgets...How Big A Brain It Has", subtitle = "Brain and Body Weights of Sixty-Two Species of Land Mammals") + labs(caption = "Source: MASS::mammals") + theme(plot.title = element_text(face = "bold")) + theme(plot.subtitle = element_text(face = "bold", color = "grey35")) + theme(plot.caption = element_text(color = "grey68")) + theme(legend.position = c(0.832, 0.21)) For more info on this dataset, type ?MASS::mammals into the console. And if you are going crazy not knowing what species is in the top right corner, it’s another elephant. Specifically, it’s the African elephant. It also never forgets how big a brain it has. :) 6.3 Simple Examples That was too fancy! Much simpler please! Let’s use the SpeedSki dataset from GDAdata to look at how the speed achieved by the participants related to their birth year: library(GDAdata) head(SpeedSki, n = 7) ## Rank Bib FIS.Code Name Year Nation Speed Sex Event ## 1 1 61 7039 ORIGONE Simone 1979 ITA 211.67 Male Speed One ## 2 2 59 7078 ORIGONE Ivan 1987 ITA 209.70 Male Speed One ## 3 3 66 190130 MONTES Bastien 1985 FRA 209.69 Male Speed One ## 4 4 57 7178 SCHROTTSHAMMER Klaus 1979 AUT 209.67 Male Speed One ## 5 5 69 510089 MAY Philippe 1970 SUI 209.19 Male Speed One ## 6 6 75 7204 BILLY Louis 1993 FRA 208.33 Male Speed One ## 7 7 67 7053 PERSSON Daniel 1975 SWE 208.03 Male Speed One ## no.of.runs ## 1 4 ## 2 4 ## 3 4 ## 4 4 ## 5 4 ## 6 4 ## 7 4 6.3.1 Scatterplot using Base R x <- SpeedSki$Year y <- SpeedSki$Speed # plot data plot(x, y, main = "Scatterplot of Speed vs. Birth Year") Base R scatterplots are easy to make. All you need are the two variables you want to plot. Although scatterplots can be made with categorical data, the variables you are plotting will usually be continuous. 6.3.2 Scatterplot using ggplot2 library(GDAdata) # data library(ggplot2) # plotting # main plot scatter <- ggplot(SpeedSki, aes(Year, Speed)) + geom_point() # show with trimmings scatter + labs(x = "Birth Year", y = "Speed Achieved (km/hr)") + ggtitle("Ninety-One Skiers by Birth Year and Speed Achieved") ggplot2 makes it very easy to create scatterplots. Using geom_point(), you can easily plot two different aesthetics in one graph. It also is simple to add on extra formatting to make your plots look nice (All that is really necessary is the data, the aesthetics, and the geom). 6.4 When to use Scatterplots are great for exploring relationships between variables. Basically, if you are interested in how variables relate to each other, the scatterplot is a great place to start. 6.5 Considerations TODO 6.5.1 Overlapping Data Data with similar values will overlap in a scatterplot and may lead to problems. Consider exploring alpha blending or jittering as remedies (links from Overlapping Data section of Iris Walkthrough). 6.5.2 Scaling Modify the scales to make it more legible 6.6 Modifications TODO Scatterplot matrices Contour Lines 6.7 Theory For more info about adding lines/contours, comparing groups, and plotting continuous variables check out Chapter 5 of the textbook. 6.8 External Resources Quick-R article about scatterplots using Base R. Goes from the simple into the very fancy, with Matrices, High Density, and 3D versions. STHDA Base R: article on scatterplots in Base R. More examples of how to enhance the humble graph. STHDA ggplot2: article on scatterplots in ggplot2. Heavy on the formatting options available and facet warps. Stack Overflow on adding labels to points from geom_point() ggplot2 cheatsheet: Always good to have close by. "],
["network.html", "7 Networks 7.1 visNetwork (interactive)", " 7 Networks 7.1 visNetwork (interactive) visNetwork is a powerful R implementation of the interactive JavaScript vis.js library; it uses tidyverse piping: https://datastorm-open.github.io/visNetwork/ –> The Vignette has clear worked-out examples: https://cran.r-project.org/web/packages/visNetwork/vignettes/Introduction-to-visNetwork.html The visNetwork documentation doesn’t provide the same level of explanation as the original, so it’s worth checking out the vis.js documentation as well: http://visjs.org/index.html In particular, the interactive examples are particularly useful for trying out different options. For example, you can test out physics options with this network configurator. 7.1.1 Minimum working example Create a node data frame with a minimum one of column (must be called id) with node names: # nodes boroughs <- data.frame(id = c("The Bronx", "Manhattan", "Queens", "Brooklyn", "Staten Island")) Create a separate data frame of edges with from and to columns. # edges connections <- data.frame(from = c("The Bronx", "The Bronx", "Queens", "Queens", "Manhattan", "Brooklyn"), to = c("Manhattan", "Queens", "Brooklyn", "Manhattan", "Brooklyn", "Staten Island")) Draw the network with visNetwork(nodes, edges) library(visNetwork) visNetwork(boroughs, connections) Add labels by adding a label column to nodes: library(dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union boroughs <- boroughs %>% mutate(label = id) visNetwork(boroughs, connections) 7.1.2 Performance visNetwork can be very slow. %>% visPhysics(stabilization = FALSE) starts rendering before the stabilization is complete, which does actually speed things up but allows you to see what’s happening, which makes a big difference in user experience. (It’s also fun to watch the network stabilize). Other performance tips are described here. 7.1.3 Helpful configuration tools %>% visConfigure(enabled = TRUE) is a useful tool for configuring options interactively. Upon completion, click “generate options” for the code to reproduce the settings. More here (Note that changing options and then viewing them requires a lot of vertical scrolling in the browser. I’m not sure if anything can be done about this. If you have a solution, let me know!) 7.1.4 Coloring nodes Add a column of actual color names to the nodes data frame: boroughs <- boroughs %>% mutate(is.island = c(FALSE, TRUE, FALSE, FALSE, TRUE)) %>% mutate(color = ifelse(is.island, "blue", "yellow")) visNetwork(boroughs, connections) 7.1.5 Directed nodes (arrows) visNetwork(boroughs, connections) %>% visEdges(arrows = "to;from", color = "green") 7.1.6 Turn off the physics simulation It’s much faster without the simulation. The nodes are randomly placed and can be moved around without affecting the rest of the network, at least in the case of small networks. visNetwork(boroughs, connections) %>% visEdges(physics = FALSE) 7.1.7 Grey out nodes far from selected (defined by “degree”) (Click a node to see effect.) # defaults to 1 degree visNetwork(boroughs, connections) %>% visOptions(highlightNearest = TRUE) # set degree to 2 visNetwork(boroughs, connections) %>% visOptions(highlightNearest = list(enabled = TRUE, degree = 2)) "],
["iris.html", "8 Walkthrough: Iris Example 8.1 Overview 8.2 Quick Note on Doing it the Lazy Way 8.3 Viewing Data 8.4 Plotting data 8.5 Markdown Etiquette 8.6 Overlapping Data 8.7 Formatting for presentation 8.8 Alter Appearance 8.9 Consider Themes 8.10 Going Deeper 8.11 Helpful links", " 8 Walkthrough: Iris Example 8.1 Overview This example goes through some work with the iris dataset to get to a finished scatterplot that is ready to present. 8.1.1 tl;dr Here’s what we end up with: library(ggplot2) base_plot <- ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(aes(color = Species), size = 3, alpha = 0.5, position = "jitter") + xlab("Sepal Length (cm)") + ylab("Sepal Width (cm)") + ggtitle("Sepal Dimensions in Different Species of Iris Flowers") base_plot + theme_minimal() Wondering how we got there? Read on. 8.1.2 Packages ggplot2 dplyr stats Base datasets (gridExtra) 8.1.3 Techniques Keyboard Shortcuts Viewing Data Structure/Dimensions/etc. Accessing Documentation Plotting with ggplot2 Layered Nature of ggplot2/Grammar of Graphics Mapping aesthetics in ggplot2 Overlapping Data: alpha and jitter Presenting Graphics Themes 8.2 Quick Note on Doing it the Lazy Way Shortcuts are your best friend to get work done faster. And they are easy to find. In the toolbar: Tools > Keyboard Shortcuts Help OR ⌥⇧K Some good ones: Insert assignment operator (<-): Alt/Option+- Insert pipe (%>%): Ctrl/Cmd+Shift+M Comment Code: Ctrl/Cmd+Shift+C Run current line/selection: Ctrl/Cmd+Enter Re-run previous region: Ctrl/Cmd+Shift+P Be on the lookout for things you do often and try to see if there is a faster way to do them. Additionally, the RStudio IDE can be a little daunting, but it is full of useful tools that you can read about in this cheatsheet or go through with this DataCamp course: Part 1, Part 2. Okay, now let’s get to it… 8.3 Viewing Data Let’s start with loading the package so we can get the data as a dataframe. library(datasets) class(iris) ## [1] "data.frame" This is not a huge dataset, but it is helpful to get into the habit of treating datasets as large no matter what. Because of this, make sure you inspect the size and structure of your dataset before going and printing it to the console. Here we can see that we have 150 observations across 5 different variables. dim(iris) ## [1] 150 5 There are a bunch of ways to get information on your dataset. Here are a few: str(iris) ## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... summary(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ## # This one requires dplyr, but it's worth it :) library(dplyr) glimpse(iris) ## Observations: 150 ## Variables: 5 ## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,... ## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,... ## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,... ## $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,... ## $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, s... Plotting the data by calling iris to the console will print the whole thing. Go ahead and try it in this case, but this is not recommended for larger datasets. Instead, use head() in the console or View(). If you want to learn more about these commands, or anything for that matter, just type ?<command> into the console. ?head, for example, will reveal that there is an additional argument to head called n for the number of lines printed, which defaults to 6. Also, you may notice there is something called tail. I wonder what that does? :) 8.4 Plotting data Let’s plot something! # Something's missing library(ggplot2) ggplot(iris) Where is it? Maybe if we add some aesthetics. I remember that was an important word that came up somewhere: # Still not working... ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) Still nothing. Remember, you have to add a geom for something to show up. # There we go! ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) + geom_point() Yay! Something showed up! Notice where we put the data, inside of ggplot(). ggplot is built on layers. Here we put it in the main call to ggplot. The data argument is also available in geom_point(), but in that case it would only apply to that layer. Here, we are saying, for all layers, unless specified, make the data be iris. Now let’s add a color mapping by Species: ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(aes(color = Species)) Usually it is helpful to store the main portion of the plot in a variable and add on the layers. The code below achieves the same output as above: sepal_plot <- ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) sepal_plot + geom_point(aes(color = Species)) 8.5 Markdown Etiquette I’m seeing that my R Markdown file is getting a little messy. Working with markdown and chunks can get out of hand, but there are some helpful tricks. First, consider naming your chunks as you go. If you combine this with headers, your work will be much more organized. Specifically, the little line at the bottom of the editor becomes much more useful. From this: To this: Just add a name to the start of each chunk: {r <cool-code-chunk-name>, <chunk_option> = TRUE} Now you can see what the chunks were about as well as get a sense of where you are in the document. Just don’t forget, it is a space after the r and commas for the other chunk options you may have like eval or echo. 8.6 Overlapping Data Eagle-eyed viewers may notice that we seem to be a few points short. We should be seeing 150 points, but we only see 117 (yes, I counted). Where are those 33 missing points? They are actually hiding behind other points. This dataset rounds to the nearest tenth of a centimeter, which is what is giving us those regular placings of the points. How did I know the data was in centimeters? Running ?iris in the console of course! Ah, you ask a silly question, you get a silly answer. # This plot hides some of the points ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(aes(color = Species)) What’s the culprit? The color aesthetic. The color by default is opaque and will hide any points that are behind it. As a rule, it is always beneficial to reduce the opacity a little no matter what to avoid this problem. To do this, change the alpha value to something other than it’s default 1, like 0.5. ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(aes(color = Species, alpha = 0.5)) Okay…a couple things with this. 8.6.1 First: The Legend First, did you notice the new addition to the legend? That looks silly! Why did that show up? Well, when we added the alpha into aes(), we got a new legend. Let’s look at what we are doing with geom_point(). Specifically, this is saying how we should map the color and alpha: geom_point(mapping = aes(color = Species, alpha = 0.5)) So, we are mapping these given aesthetics, color and alpha, to certain values. ggplot knows that usually the aesthetic mapping will vary since you are probably passing in data that varies, so it will create a legend for each mapping. However, we don’t need a legend for the alpha: we explicitly set it to be 0.5. To fix this, we can pull alpha out of aes and instead treat it like an attribute: ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(aes(color = Species), alpha = 0.5) No more legend. So, in ggplot, there is a difference between where an aesthetic is placed. It is also called MAPPING an aesthetic (making it vary with data inside aes) or SETTING an aesthetic (make it a constant attribute across all datapoints outside of aes). 8.6.2 Second: Jittering Secondly, did this alpha trick really help us? Are we able to see anything in the plot in an easier way? Not really. Since the points perfectly overlap, the opacity difference doesn’t help us much. Usually, opacity will work, but here the data is so regular that we don’t gain anything in the perception department. We can fix this by introducing some jitter to the datapoints. Jitter adds a little random noise and moves the datapoints so that they don’t fully overlap: ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(aes(color = Species), alpha = 0.5, position = "jitter") Consider your motives when using jittering. You are by definition altering the data, but it may be beneficial in some situations. 8.6.3 Aside: Example Where Alpha Blending Works We are dealing with a case where jittering works best to see the data, while changing the alpha doesn’t help us much. Here’s a quick example where opacity using alpha might be more directly helpful. # lib for arranging plots side by side library(gridExtra) # make some normally distributed data x_points <- rnorm(n = 10000, mean = 0, sd = 2) y_points <- rnorm(n = 10000, mean = 6, sd = 2) df <- data.frame(x_points, y_points) # plot with/without changed alpha plt1 <- ggplot(df, aes(x_points, y_points)) + geom_point() + ggtitle("Before (alpha = 1)") plt2 <- ggplot(df, aes(x_points, y_points)) + geom_point(alpha = 0.1) + ggtitle("After (alpha = 0.1)") # arrange plots gridExtra::grid.arrange(plt1, plt2, ncol = 2, nrow = 1) Here it is much easier to see where the dataset is concentrated. 8.7 Formatting for presentation Let’s say we have finished this plot and we are ready to present it to other people: We should clean it up a bit so it can stand on its own. 8.8 Alter Appearance First, let’s make the x/y labels a little cleaner and more descriptive: ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(aes(color = Species), alpha = 0.5, position = "jitter") + xlab("Sepal Length (cm)") + ylab("Sepal Width (cm)") Next, add a title that encapsulates the plot: ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(aes(color = Species), alpha = 0.5, position = "jitter") + xlab("Sepal Length (cm)") + ylab("Sepal Width (cm)") + ggtitle("Sepal Dimensions in Different Species of Iris Flowers") And make the points a little bigger: ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(aes(color = Species), size = 3, alpha = 0.5, position = "jitter") + xlab("Sepal Length (cm)") + ylab("Sepal Width (cm)") + ggtitle("Sepal Dimensions in Different Species of Iris Flowers") Now it’s looking presentable. 8.9 Consider Themes It may be better for your situation to change the theme of the plot (the background, axes, etc.; the “accessories” of the plot). Explore what different themes can offer and pick one that is right for you. base_plot <- ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) + geom_point(aes(color = Species), size = 3, alpha = 0.5, position = "jitter") + xlab("Sepal Length (cm)") + ylab("Sepal Width (cm)") + ggtitle("Sepal Dimensions in Different Species of Iris Flowers") base_plot base_plot + theme_light() base_plot + theme_minimal() base_plot + theme_classic() base_plot + theme_void() I’m going to go with theme_minimal() this time. So here we are! We got a lovely scatterplot ready to show the world! 8.10 Going Deeper We have just touched the surface of ggplot and dipped our toes into grammar of graphics. If you want to go deeper, I highly recommend the DataCamp (DataCamp 2018) courses on Data Visualization with ggplot2 with Rick Scavetta. There are three parts and they are quite dense, but the first part is definitely worth checking out. 8.11 Helpful links RStudio ggplot2 Cheat Sheet DataCamp: Mapping aesthetics to things in ggplot R Markdown Reference Guide R for Data Science References "],
["publish.html", "9 Publishing with R 9.1 Overview 9.2 Bookdown 9.3 Essentials 9.4 Trimmings", " 9 Publishing with R 9.1 Overview This section discusses how we built edav.info/ and includes references for building sites/books of your own using R. 9.2 Bookdown edav.info/ is built using Bookdown, “a free and open-source R package built on top of R Markdown to make it really easy to write books and long-form articles/reports.” The biggest selling-point for bookdown is that it allows you to make content that is both professional and adaptable. If you want to update a regular book, you need to issue another edition and go through a lot of hassle. With bookdown, you can publish it in different formats (including print, if desired) and be able to change things easily when needed. We chose bookdown for edav.info/ because it allows us to present a lot of content in a compact, searchable manner, while also letting students suggest updates and contribute to its structure. Again, it is professional and adaptable (The default bookdown output is essentially just an online book, but we tried to liven it up by adding a lot of helpful icons, logos, and banners to improve navigation). Below are some helpful references we used in creating edav.info/, which may be helpful if you are interested in creating your own website or online resource with R. 9.3 Essentials How to Start a Bookdown Book: The hardest part about bookdown is getting it up and running. Sean Kross has the best template instructions we found. We started this project by cloning his template repo and building off of it. Excellent descriptions on what all the files do and what is essential to start your project. bookdown: Authoring Books and Technical Documents with R Markdown: This textbook by Yihui Xie, author of the bookdown package, explains everything bookdown is able to accomplish (published using bookdown…because of course it is). An incredible informative reference which we always kept close by. Author’s blurb: A guide to authoring books with R Markdown, including how to generate figures and tables, and insert cross-references, citations, HTML widgets, and Shiny apps in R Markdown. RStudio Bookdown Talk: Yihui Xie (author of the bookdown package) discusses his package and what it can do in a one-hour talk. Good for seeing finished examples. bookdown.org: Site for the bookdown package. Has a bunch of popular books published using bookdown and some info about how to get started using the package. Creating Websites in R: This tutorial, written by Emily Zabor (a Columbia alum), provides a thorough walkthrough for creating websites using different R tools. She discusses how to make different kinds of sites (personal, package, project, blog) as well as GitHub integration and step-by-step instructions for getting setup with templates and hosting. Very detailed and worth perusing if interested in making your own site. 9.4 Trimmings Custom Domain Names: GitHub integration with custom domain names is easy to setup. GitHub has an article on how to setup a custom domain with GitHub Pages that will help to get your desired URL hooked up (custom domain names: the vanity plates of the internet). GitHub Pages supports free hosting, which makes the whole process a lot easier. Also, if you are in the market for a cool domain name, Google Domains is a great place to get the one of your dreams. Custom 404 Page: Your site may be lovely, but a default 404 page is always a let down. Not if but when someone types part of your URL incorrectly or a link gets broken, you should make sure there is something to see other than a boring backend page you had no input in designing. This article explains the process, but all you have to do is make a file called 404.html in your root directory and GitHub will use it rather than the default. Because of this, there is really no excuse for not having one. Here’s a look at our 404 page. Hopefully you aren’t seeing it that often. :) Some considerations: Always include a link back to the site: Throw the user a life-saver. Make it clear that something went wrong: Don’t hide the fact that this page is because of some error. Other than that, have fun with it!: There are plenty of examples of people making excellent 404 pages. It should make a frustrating experience just a little bit more bearable. "],
["general.html", "10 General Resources 10.1 Books 10.2 Cheatsheets", " 10 General Resources This is a long list of helpful general resources related to EDAV. 10.1 Books A lot of these are available for students through Columbia Libraries, in both physical and e-book formats. 10.1.1 Graphical Data Analysis with R This book systematically goes through the different types of data, including categorical variables, continuous variables, and time series. The author shows different examples of plotting techniques using ggplot and promoting the “grammar of graphics” model. Code snippets included and available at the book’s website. 10.1.2 R for Data Science The classic. Everything from data types, programming, modeling, communicating, and those keyboard shortcuts you keep forgetting. To quote the book, “this book will teach you how to do data science with R.” Nuff said. 10.2 Cheatsheets 10.2.1 Cheatsheet of cheatsheets Paul van der Laken has put together a large collection of R resource links, including cheat sheets, style guides, package info, blogs, and other helpful resources. 10.2.2 RStudio Cheatsheet Collection Collection of downloadable cheatsheets from RStudio. Includes ones on R Markdown, Data Transformation (dplyr), and Data Visualization (ggplot2). They also have a R Markdown Reference Guide, which is great for remembering that one chunk option that’s on the tip of your tongue. "]
]