In [1]:
# Uncomment these lines if necessary!
# install.packages("testthat")
# install.packages("IRdisplay")
# install.packages("tidyverse")
# install.packages("tidymodels")

In [2]:
library(testthat)
library(IRdisplay)
library(tidyverse)
library(tidymodels)

"package 'testthat' was built under R version 4.0.3"
"package 'IRdisplay' was built under R version 4.0.3"
"package 'tidyverse' was built under R version 4.0.3"
-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.3.0 --

[32mv[39m [34mggplot2[39m 3.3.2     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.0.3     [32mv[39m [34mdplyr  [39m 1.0.2
[32mv[39m [34mtidyr  [39m 1.1.2     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 1.3.1     [32mv[39m [34mforcats[39m 0.5.0

-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m  masks [34mstats[39m::filter()
[31mx[39m [34mpurrr[39m::[32mis_null()[39m masks [34mtestthat[39m::is_null()
[31mx[39m [34mdplyr[39m::[32mlag()[39m     masks [34mstats[39m::lag()
[31mx[39m [34mdplyr[39m::[32mmatches()[39m masks [34mtidyr[39m::matches(), [34mtestthat[39m::matche

<h1 style="text-align: center;">Data Visualization for People in a Hurry</h1>
<p style="text-align: center;"><i>an nwPlus workshop ✨</i></p>

Workshop slides available [here](https://docs.google.com/presentation/d/e/2PACX-1vSPf9e7YfHleOqbkfxuiwXBQNh59jhZoULyXrwL1X1TO8I9IGdlG5lFN4zAlvFEtH0CNnOM_WhpyasR/pub?start=true&loop=true&delayms=30000).

### Goals 🎯
- Produce your first visualization of data using the R programming language
- Learn about how data science can enhance your workflow in non-programming courses
- Explore data science at UBC, in the context of hackathons, and in the working world


### Links 🔗

- **GitHub repository:** Source code [here](https://github.com/michaelfromyeg/data-viz-for-people-in-a-hurry) (currently private).
- **Data set:** Available [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data), from Kaggle, or [here](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)), directly from UCI. It's also included in the repo!

**Credits**

I was inspired by a few other sources for this workshop that I recommend you check out!

- [R Workshop Notes](http://tutorials.iq.harvard.edu/R/Rintro/Rintro.html), *Harvard University* — good background to R
- [Learn X in Y Minutes: R](https://learnxinyminutes.com/docs/r/), *Learn X in Y Mintues* — good refresher for syntax
- [Introduction to Data Science](https://ubc-dsci.github.io/introduction-to-datascience/index.html), *Tiffany-Anne Timbers, Trevor Campbell, Melissa Lee* — UBC's DSCI 100 course textbook

##### PART I — 1 hour

### 1—What is data science? 🤔
⏱ **10 mins**

See the [workshop slides](https://docs.google.com/presentation/d/e/2PACX-1vSPf9e7YfHleOqbkfxuiwXBQNh59jhZoULyXrwL1X1TO8I9IGdlG5lFN4zAlvFEtH0CNnOM_WhpyasR/pub?start=true&loop=true&delayms=30000) for more information.

#### 1.1—An introduction to data science

#### 1.2—A four part process

#### 1.3—The big picture


### 2—Wranglin' data 🤠
⏱ **20 mins**

#### 2.0—R

R can be sometimes a bit tricky to read. Here's the basic syntax you need to know for today.

- Assigning a variable
- Method calls
- Parameters
- Printing to your notebook cell's output

We'll learn each of these as we go, but up front, let's get comfortable with variables.

#### Variables

In [3]:
# Variables

## Hold on to data (allow you to save a result or calculation)
## Can "change" (i.e., vary, hence variable)
## One created, available anywhere in your program 
##   (including later blocks of code)

## To assign a variable in R, we use a fancy arrow, "<-"
## The arrow means, take the thing on the right hand side and save it
## to the variable on the left

5 + 4 # This does not save our result anywhere!
x <- 5 + 4

In [4]:
# Printing data

## To print any data in R, we can simply "put" the variable on its own line
## If you want to be more explicit about it you can also write `print(my_variable)`
## Note the syntax here and lack of spaces. We write print(...), where ... is the variable we want to print. 
## It's often said that the ... is "wrapped" by parentheses.

x # We can access x down here!
print(x) # Wrapping "x" by ()
print(x + 2)

[1] 9
[1] 11


In [5]:
# Your turn: in this cell, create a variable y that is equal to 7. 
# Then, created a variable z that is equal to the sum of x and y.
# Finally, print z

### BEGIN SOLUTION ###
y <- 7
z <- x + y
### END SOLUTION ###

In [7]:
# These tests will tell you if you're on the right track. Once you run this cell, it should simply say "Test passed"

test_that("y is correct", {
  expect_equal(y, 7)
})
test_that("z is correct", {
  expect_equal(z, 16)
})

[32mTest passed[39m 
[32mTest passed[39m 


In [None]:
paste(x, y) # Should be 'hello world'

In [None]:
test_that("hello world prints", {
  expect_equal(paste(x, y), "hello world")
})

#### 2.1—Importing data

The first step is to import our data set, but we have to learn a bit more about R to do that.

In [None]:
# Functions
## This concept is very similar to math. Your classic y = f(x) is verymuchso the same concept in programming, 
## under the same name. We use functions to map 0-to-many inputs (called parameters, or arguments) to an 
## output, often called the result or return value. 

## The general form of a function *call* (this means we're using the function, that is, producing our y value) is,
## result <- my_function(argument1, argument2, ..., argumentN) [see how this mirrors y = f(x)?]

## When working with R, we very rarely create our own function (such as, saying h(x) = (x + 3) / 2), but we do use other
## people's functions. Let's practice that!

## Your turn: call the help function with a single argument, also just the text help (not the string "help"!)

### BEGIN SOLUTION
### END SOLUTION

## Psst... this is how you can access help in R. Try typing in help(print) or help(paste)

In [None]:
## Your turn: call the print function twice. First, with the number 5, and second, with the string nwPlus.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
# Libraries

## To get access to more functions that are already "built in" to R, we need to access things called libraries.
## You can think of libraries as collections of functions that make R much more powerful. 
## For today's workshop, we just need one library, called tidyverse. To import a library, we use the library(...) function;
## it accepts a library's name as the parameter (not as a string, just the actual text name).

## P.S. Sometimes, we need to install a library first before we use it. To do this, just run install.packages("...") where
## ... is the name of the desired library.

## Your turn: install the caret package and import it 

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## Now, finally, let's import the needed libraries for this workshop.
library(tidyverse)

In [None]:
## With the tidyverse package, we get access to a function called `read_csv` that allows us to import data from a URL
## (https://www...) or a local file.

## Your turn: call the read_csv function with the argument data.csv (this is a local file). Save it to the variable
## `cancer_data`

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
# Your turn: print out the data you just imported

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
display_html(HTML_STRING)
test_that("cancer data is read in correctly", {
  expect_equal(cancer_data, read_csv("data.csv"))
})

#### 2.2—Exploring the data

Woah, that's a lot of data!

Before we know how to clean up the data or visualize it, we need to understand what form the data is in. But, it's really hard to do that when it R spews out that much.

Thankfully, there's a few functions we can use to help manage this.

In [None]:
## R provides two functions to visualize a "slice" of your data. They're called head and tail.
## Let's experiment with them.

## Your turn: call the function head with your data as the only argument.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## Your turn: call the head function with two arguments. First, the data like before. 
## Second, try adding a number (1, 2, 5, 10). What effect does this have?

### BEGIN SOLUTION ###
### END SOLUTION ###

#### 2.3—Filtering data

Great, now we understand what our data looks like! To produce the result we want, we might want to reduce the scope of our analysis, or filter out bad rows. What if we want to count and see the total number of benign tumors? Or get only tumors where the `smoothness_mean` is over a certain value?

To solve problems like this, we need to use filter.

In [None]:
## Filter accepts two arguments. The first is your data, and the second is the condition. Conditions can be:
### exactly equal ==
### greater than > (or, greater than or equal to >=)
### less than < (or, less than or equal to <=)

## Conditions must produce a true or false value. Let's try working with true and false before we use filter.

## Your turn: print the result of 5 == 4. Print the result of 5 == 5. Ntice anything interesting about true and false
## (think: in terms of spelling, capitalization, or punctuation)

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## Your turn: print the result of whether or not negative one hundred is greater than zero. Save that to a variable called
## is_colder

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
display_html(HTML_STRING)
test_that("is_colder is true", {
  expect_false(is_colder)
})

In [None]:
## Your turn: do the same for malignant rows. (What's the filtering condition?)
## Save it to a variable called malignant_rows. Print the tail of that variable.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## R gives us a very fast way to check the "size" of a table (that is, the number of rows).
## Fittingly, the function is called nrow.

## Your turn: get the number of malignant tumors in the data set and save it to a variable called num_malignant. Do 
## the same for benign.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## Your turn: compute the total number of rows in the data set. Save it to a variable called total.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## Your turn: print the percentage benign, and percentage malignant in the data set.
## Hint: this is just a math problem.

# Challenge: Try printing it "well formatted". That is, only to two decimal places and with a '%' symbol at the end.

### BEGIN SOLUTION ###
### END SOLUTION ###

#### 2.4—Selecting data

There are a ton of columns associated with the cancer data. These are sometimes called factors, or aspects. We often want to do our analysis on just a few of those columns. R has an easy way of doing this!

In [None]:
## The function we use is called "select". Select's first parameter must be the data set.
## Then, every parameter after must be a column we want to keep.

## for example, selected_columns <- select(original_data, column1, column2, ..., columnN)
## You can select as many, or as few, columns as you want

## Your turn: select only the diagnosis, radius_mean, and smoothness_mean columns from the (original) cancer data. Save
## it to a variable called cancer_select.

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## We can also choose to select all columns *except* one (or two, or three, etc.) We do this through the minus symbol.

## for example, selected_columns <- select(original_data, -column1, -column2, ..., -columnN)
## This removes column1, column2, ..., columnN

## Your turn: select every column except the diagnosis column and id column. Save it to a variable called anonymous_data.

### BEGIN SOLUTION ###
### END SOLUTION ###

#### tl;dr

This entire process of selecting, filtering, and modifying data is generally referred to as "wrangling." This process is extremely important, because working with good, clean data is vital to producing a good visualization.

Want to learn more about what "clean" data means? Come to part 2 of this workshop!

#### 2.X—Challenge

Sometimes data comes with imperfections. Here we learned the tools to "fix" those imperfections, but the data we're using doesn't have very many.

There is one thing though--the column X33 contains a lot of NA values. What if we wanted to filter those out?

In [None]:
## Your turn: produce a table that has all of the NAs removed, and contains all of the _worst columns, 
## including the diagnosis

### 3—Your first visualization 📊
⏱ **30 mins**

#### 3.1—ggplot

ggplot is a really fancy R function used to create lots of different kinds of visualizations, or "plots."

It is super powerful, and you can create *tons* of different charts with it. The only way to master ggplot is to experiment and try many different things. Here, I'll try to give you a taste of what's possible!

In [None]:
## Here's a basic scatterplot of the data we've gathered

basic_plot <- ggplot(cancer_select, aes(x = radius_mean, y = smoothness_mean)) +
    geom_point()
basic_plot

ggplot looks scary, so let's try to break it down. Firstly, you should recognize ggplot is a function, and takes a series of arguments. The first argument is your data set. The second is the "aes", short of the "aesthetic specifications". For our purposes, this is just a function that accepts the x and y columns we'd like to use.

What is the "+" symbol at the end doing? We'll get to that in the next section.

In [None]:
## Your turn: create a scatterplot with perimeter_mean along the x axis and area_mean along the y axis.

## Hint: copy and paste is your friend.

### BEGIN SOLUTION
### END SOLUTION

In [None]:
## Your turn: using your solution from the last cell, change "geom_point()" to "geom_violin()". What happens?

### BEGIN SOLUTION
### END SOLUTION

## Psst... see a complete list of different geom_...()s you can put here: https://ggplot2.tidyverse.org/reference/

#### 3.2—Adding layers

We can add layers to our plot to supply additional graphics; this is also useful for adding additional data, such as a single point, to our graph. Here's a rather involved example.

In [None]:
ggplot(cancer_select, aes(x = radius_mean, 
                          y = smoothness_mean)) +
    geom_point() + 
    xlab("Radius mean") +
    ylab("Smoothness mean") + 
    geom_smooth(method=lm,   # Add linear regression line
                se=FALSE) +    # Don't add shaded confidence region
    theme(text = element_text(size = 30))

Notice the "+" symbol at the end of every line? This is something special we do with ggplot, usually referred to as "adding layers". We can make our visualization more complex by adding one "layer" at a time. A layer typically refers to something like geom_point()—which adds a point layer to our graph—or geom_smooth() or the countless other layers we can add, but you can also use "+" to add things like a theme to your plot, a legend, labels for your x- and y- axes, and more. The possibilities are endless!

In [None]:
## Your turn: make the font size of basic_plot 60 by adding a layer. Save it to a variable called big_text_plot.

# Hint: use the example above.

### BEGIN SOLUTION ###
### END SOLUTION ###

#### 3.3—Getting fancy

Scatterplots are cool and all, but what else can we do? What if we wanted to group the points? How can we create a more effective visualization?

One thing that might be natural with data like this, where we have a binary grouping (benign or malignant tumors) is to color the data. How do we do that?

With aes! Here's an example.

In [None]:
# Change plot size
options(repr.plot.width = 10, repr.plot.height = 10, repr.plot.res = 100)

ggplot(cancer_select, aes(x = radius_mean, 
                          y = smoothness_mean, 
                          color = diagnosis)) +
    geom_point() + 
    xlab("Radius mean") +
    ylab("Smoothness mean") + 
    theme(text = element_text(size = 30))

In [None]:
## Your turn: create a scatterplot with perimeter_mean along the x axis and area_mean along the y axis. The plot should
## have benign tumors colored red, and malignant tumors color orange. The x- and y- axes should be labelled.

## Bonus: add a title to the plot (don't be afraid to use Google!)

### BEGIN SOLUTION ###
### END SOLUTION ###

In [None]:
## Your turn: create some kind of other plot that is not a violin, and not a scatterplot, 
## using the skills you learned today.

### BEGIN SOLUTION ###
### END SOLUTION ###

#### 3.X—Challenge

In [None]:
# Your turn: Create a 3D scatterplot with a line of best fit. That is, use three predictors 
# (the third one is your choice, two should be radius and smoothness). The plot should still be colored.

### BEGIN SOLUTION ###
### END SOLUTION ###

##### PART II — 1 hour

Attend part 2 of this workshop series to see how you can extend what we've just done to predict whether or not a tumor is benign or malignant!

Thank you so much for attending part 1 ✨ I hope you enjoyed!