# Hands-on with R + ggplot2

# 1. Improving Pie Charts

*What is wrong with this figure?* 

![](https://drive.google.com/uc?id=1K6hCHovjZV5Icbn3zd-gW86RSRjiH0i-)

## Let's agree that this is a monstrosity.  Now, how do we improve it?

In [None]:
# import the necessary library
library(ggplot2)

## 1.1. Read in the data

*This is a made up data set from a colleague of mine. We have 10 items, each with a text label and a numeric value.*

*I'm using ```read.csv``` to read in the data.*

In [None]:
url <- 'https://drive.google.com/file/d/1iWAtKk7aOinwb-pJ-Cy-hiB5xBv3Z-b5/view?usp=sharing'

# Extract file ID from the URL
file_id <- strsplit(url, "/")[[1]][6]

# Construct the direct download link
direct_url <- paste0('https://drive.google.com/uc?id=', file_id)

data <- read.csv(direct_url)
data

## 1.2. For many uses cases (including this) a bar chart is a better option than a pie chart.

*Humans can more easily interpret differences in bar charts. Pie charts require us to interpret areas = slow, while bar charts use position = fast. Generally, you should choose a bar chart over a pie chart when:*
- *There are too many categories to easily distinguish between pie chart areas (as we have here).*
- *Slice sizes in the pie chart are too similar (as we have here).*
- *You have multiple data sets (which we do not have here).*
- *When the raw percentages can provide as much (or more) meaning than fraction of a whole (as we have here).*

*Pie charts are only useful when there are few categories, each category has a very different percentage, AND the purpose of your visualization is to show fractions of a whole.*

*Here is the default bar chart from ggplot.  Leaves lots to be desired...*

In [None]:
ggplot(data, aes(x = Label, y = Value)) +
    geom_bar(stat = "identity") # use stat = "identity" because we are supplying the actual bar values

## 1.3. Improve the axis labels and add a plot title

*The text for the bars are unreadable.  How should we fix that?*

In [None]:
ggplot(data, aes(x = Label, y = Value)) +
    geom_bar(stat = "identity") + # use stat = "identity" because we are supplying the actual bar values
    labs(title = "Percentage of Poor Usage", x = "", y = "Percent")

## 1.4. Fix the bar text, sort the data, add the percentage values to each bar

In [None]:
ggplot(data, aes(x = reorder(Label, Value), y = Value)) +
    geom_bar(stat = "identity") + # use stat = "identity" because we are supplying the actual bar values
    labs(title = "Percentage of Poor Usage", x = "", y = "") + 
    coord_flip() + # this flips the plot to horizontal
    geom_text(aes(label = paste0(Value,"%")), vjust = 0, hjust = -0.2) + # add labels
    ylim(0,11) # add some space for the text labels; since we flipped the plot we use "ylim" (instead of "xlim")

## 1.5. Clean this up a bit
- *I don't want the grid lines anymore*
- *We can remove the axes entirely*
- *Make the font larger*
- *Let's change the colors, and highlight one of them*
- *Save the plot*

In [None]:
# Make plot wider for display
options(repr.plot.width = 15, repr.plot.height = 8)

ggplot(data, 
        aes(
            x = reorder(Label, Value),
            y = Value,
            fill = factor(ifelse(Label == "Color Choice", "Highlighted", "Normal")) # to highlight one bar
        )
    ) + 
    geom_bar(stat = "identity", show.legend = FALSE) + # use stat = "identity" because we are supplying the actual bar values
    labs(title = "Percentage of Poor Usage in Data Visualization", x = "", y = "") + 
    coord_flip() + # this flips the plot to horizontal
    geom_text(aes(label = paste0(Value,"%")), vjust = 0, hjust = -0.2, size = 6) + # add labels
    ylim(0,11) + # add some space for the text labels; since we flipped the plot we use "ylim" (instead of "xlim")
    scale_fill_manual(name = "", values = c("orange","grey50")) + # set the colors for highlighting
    theme_classic() + # there are many themes to choose from : https://ggplot2.tidyverse.org/reference/ggtheme.html
    theme(
        axis.line = element_blank(), # remove the remaining axis lines
        axis.text.x = element_blank(), # remove x axis labels
        axis.ticks.x = element_blank(), # remove x axis ticks
        axis.ticks.y = element_blank(),  # remove y axis ticks
        axis.text = element_text(size = 20), # increase the font size of the labels
        plot.title = element_text(size = 30) # increase the font size of the title
    )

# save the figure (have to specify size here again)
# ggsave("bar_r.pdf", device = "pdf", width = 15, height = 8)

# 2. Scatter Plots

In [None]:
# import the necessary library
library(ggplot2)

## 2.1. Read in the data

*I downloaded [2024 Chicago taxi data](https://data.cityofchicago.org/Transportation/Taxi-Trips-2024-/ajtu-isnz/about_data) from the [Chicago data portal](https://data.cityofchicago.org/).  This dataset has millions rows and many columns (and is about 1.3G), and therefore may take some time to load and visualize.*

*If you want to run this code locally, please either download the data from the Chicago Data Portal linked above, or the version that I have on Google Drive [here](https://drive.google.com/file/d/1QPS8DY2bDCbttMf4dEIIC3LOdYlph7sJ/view?usp=sharing).  (The dataset is too large to host on GitHub.)*  

*Here, we will look at columns for `Fare` and `Tips`.*

In [None]:
# this assumes that you have downloaded the data (as above), and placed it in a data directory with the file name 'Taxi_Trips__2024-__20240731.csv'
df <- read.csv('data/Taxi_Trips__2024-__20240731.csv')
head(df)

## 2.2 Let's plot the `Fare` vs. `Tips` data as a scatter plot.

*Is there anything that we should improve upon here?*

In [None]:
# create the scatter plot
ggplot(data = df, aes(x = Fare, y = Tips)) +
    geom_point() 

## 2.3 Let's improve this
- *Change the axis range.*
- *Try open circles as symbols.*
- *Add a title and some descriptive labels to the axes.*
- *Increase the font sizes.*

In [None]:
ggplot(data = df, aes(x = Fare, y = Tips)) +
    geom_point(shape = 1, size = 2) + 
    labs(
            title = "How Chicagoans Tipped their Cab Drivers in 2024", 
            x = "Fare ($)",
            y = "Tip ($)"
    ) +
    xlim(0,150) + ylim(0,150) +
    theme(
        panel.grid.major = element_blank(),  # remove the grid
        panel.grid.minor = element_blank(), # remove the grid
        axis.title = element_text(size = 20), # increase the font size of the axis titles
        plot.title = element_text(size = 30), # increase the font size of the title
        axis.text = element_text(size = 16), # increase the font size of the tick labels
        aspect.ratio = 1, # so it's not as wide as the default
    )

# save the figure (have to specify size here)
# ggsave("scatter_r.pdf", device = "pdf", width = 8, height = 5)

## 2.4 Can we improve this more?
- *Use a 2d histogram instead.  (Often when you have so much overlapping data, it is easier for the view to switch to a 2d histogram or contour plot, or similar).*
- *Include a colorbar.*
- *Add lines at typical tip rates and label them?*

In [None]:
library(scales)  # For log transformation
library(dplyr) 

In [None]:
# explicitly set the plot size for Jupyter display 
options(jupyter.plot_mimetypes = "image/png", repr.plot.width = 9, repr.plot.height = 5, repr.plot.res = 300)

# Create the plot
p <- ggplot(df, aes(x = Fare, y = Tips)) +
    geom_bin2d(bins = 60) +
    scale_fill_continuous(
        trans = 'log', 
        low = "white", high = "darkblue",
        limits = c(1, 1e4),# Set the minimum and maximum values for the colormap
        oob = scales::squish, # Map out-of-bounds values to the maximum color
        breaks = c(1, 10, 100, 1e3, 1e4),  # Define the legend values to show
        guide = guide_colourbar(
            title = "Number of Rides",
            title.theme = element_text(size = 20),   # Title font size
            label.theme = element_text(size = 16),    # Label font size
        )
    ) +
    labs(
        title = "How Chicagoans Tipped\ntheir Cab Drivers in 2024", 
        x = "Fare ($)",
        y = "Tip ($)"
    ) +
    scale_x_continuous(limits = c(0, 150), expand = c(0, 0)) +
    scale_y_continuous(limits = c(0, 60), expand = c(0, 0)) +
    theme_bw() + # remove the gray background
    theme(
        panel.grid.major = element_blank(),  # remove the grid
        panel.grid.minor = element_blank(), # remove the grid
        axis.title = element_text(size = 20), # increase the font size of the axis titles
        plot.title = element_text(size = 30), # increase the font size of the title
        axis.text = element_text(size = 16), # increase the font size of the tick labels
        aspect.ratio = 1, # so it's not as wide as the default
        legend.key.height = unit(1.5, "cm")
    ) 


# add lines at standard tip rates (uncomment below to include the lines in the plot)
# tip_pcts <- c(0.2, 0.25, 0.30, 0.4)
# p <- p + 
#   geom_abline(data = data.frame(slope = tip_pcts), aes(intercept = 0, slope = slope), color = 'black', linetype = 'dashed', alpha = 0.7) +
#   geom_text(data = data.frame(pct = tip_pcts), aes(x = 120, y = pct*130, label = paste0(pct*100, "%"), angle = 100*pct), color = 'black', hjust = 0, alpha = 0.7)


# show the plot
print(p)