

# ggplot Deep dive ![ggplot.png](attachment:dcce5394-5c41-4889-a1c8-9da28ffad236.png)

In this notebook I will be taking a deep dive into the syntaxt of ggplot. While there are many resources to learn the ins and outs of ggplot, they are scattered all over the web. The goal of this notebook is to syntesize and add to these resources in a single repository.
<br>
<br>
Throughout this notebook we will build on gpplot graphs iteratively to progressively create more complex, and complete, data visualizations. 
<br>
<br>
The [data](https://www.kaggle.com/spscientist/students-performance-in-exams) in this notebook consists of students' grades in maths, reading and writing. In addition, it contains information about students' gender and race/ethnicity, as well as about students' parents' level of education, among others.
<br>
<br>
Let's dive in!

* [Installing ggoplot2](#Installing_ggplot2)
* [Understanding the basics of ggsyntax](#basics_ggsyntax)
* [Graph customization](#graph_customization)
    - [Colors](#colors)
    - [Legends](#legends)
    - [X & Y Axes](#legends)


<a id="Installing_ggplot2"></a>
## 1. Installing ggplot2

ggplot is part of the [tidyverse](https://www.tidyverse.org/) a collection of packages that share common principles and are designed to work together seamlessly. These packages are useful for a myriad pf taskss from data wrangling to data analysis.
<br>
ggplot can be installed on two ways:

install.packages("ggplot2")
<br>
library(ggplot2)
<br>
<br>
or
<br>
<br>
install.packages("tidyverse")
<br>
library(tidyverse)
<br>
where the second option will not only install ggplot but all other packages in the tidyverse.


In [None]:
# Install ggplot2
install.packages("ggplot2") #install package
library(ggplot2) #library call

<a id="basics_ggsyntax"></a>
## 2. Understanding the basics of ggsyntax

According to the [tidyverse documentation](https://ggplot2.tidyverse.org/reference/ggplot.html), the function ggplot() initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.
<br>
<br>
ggplot() is used to construct the initial plot object, and is almost always followed by + to add components to the plot.
<br>
<br>
Now, let's dissect a ggplot. In the line of code below, the first, non-optional argument of ggplot() is the dataframe to be used. This is followed by the aesthetics aes() function which be used to construct the aesthetic mappings. In layman terms, this is where we'll have:
* The variables for our x and y axes
* Other aesthetics related to the color, and fill among others (we'll see this later)

In [None]:
ggplot(dataframe, aes(x, y, other aesthetics))

Normally, ggplot objects will have several layers in which we can add extra functions. **Geometry components** are the most common layers of ggplot. They determine the kind of plot to be generated, for instance:
<br>
* geom_bar() --> bar plot
* geom_point() --> scatter plot
* geom_smooth() --> add regression line to scatterplot
* geom_boxplot() --> boxplot
* geom_line() -->line graph
* geom_histogram() --> histogram

*among many others

We add these "geom" functions to graphs using the *+* sign.
For example, the bar graph below shows that out of the 1000 students tested, more than half were females.

In [None]:
#Import data
data <- read.csv("../input/students-performance-in-exams/StudentsPerformance.csv")
names(data)[2] <- "race"
names(data)[3] <- "education"
names(data)[5] <- "preparation"
names(data)[6] <- "math"
names(data)[7] <- "reading"
names(data)[8] <- "writing"

In [None]:
# Generate plot
ggplot(data, aes(x=gender))+
geom_bar()

Now, this plot seems pretty dull, let's add some color to it! For this we will use the "color" and "fill" optional arguments in the aes() function. As you can see, I am using the variable *gender* as fill and color. This means that the colors displayed by the graph will correspond to the two levels of the varable *gender*, namely *male* and *female*.
<br>
<br>
Just to specify: *color* affects the **outline** of the graph, while *fill* affects **the color the graph is filled with**...yes, a bit counterintuitive.

In [None]:
# Generate plot
# Red and blue are R's default colors
ggplot(data, aes(x= gender,color=gender,fill= gender))+ # plot gender with distinct colors
geom_bar() # make it a bar graph


However, there are other ways to color in plots. Since geoms are functions, each of them has its own arguments. For instance, on of the many arguments of [geom_bar](https://ggplot2.tidyverse.org/reference/geom_bar.html) is "fill", which allows to set the color of the plot, as can be seen below. Importantly, as you can see, the fill in geom_bar overrides the fill in the aes() function; in doing so, it superimposes the color *steelblue* over the colors that would have differentiated both genders. In this case, which fill do you think would be more informative?

**In all, the most important lesson to take away from here is the following: everything you write *inside* aes() references variables and is based on data, while everything *outside* aes() are aesthetic parameters that make the plot pretty.**

In [None]:
ggplot(data, aes(x= gender, fill= gender))+ # plot gender by color
geom_bar(fill= "steelblue") # override, make it a steel blue bar graph

<a id="graph_customization"></a>
## 3. Graph customization

In this section of the notebook, I will employ bar plots,given that they are some of the most common graphs across disciplines, to exemplify a lot of different modifications that can be made to plots. In further sections, I will be explaining how to make other types of plots as well but, for now, let's stick to bar plots.

Remember, we will be building our plots iteratively, that is, adding layer after layer little by little. I suggest you do this because, if you run into problems, you will know exactly where you messed up, since you're building plots step by step.

In all we will cover the following aspects of graph customization:
* Colors
* Legends
* Axes
* Text
* Themes
* Faceting
* Saving



<a id="colors"></a>
### Colors
Coloring plots is almost an art form, that's why there is so much information out there on colors and palettes for ggplot2. To see some *excellent* ones, go [here](https://www.r-graph-gallery.com/ggplot2-color.html) or [here](http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/).
<br>
<br>
There are 5 basic ways to select color in ggplot, just take your pick:
* using the color of the name ("red")
* Using rgb() to build your own color
* Using the number of the color
* Using the hexadecimal code for the color
<br>
<br>
There are also full *palettes* for ggplot. One of the most popular ones is the [Brewer palette](https://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3), though there are many others.
<br>
<br>
Now let's put colors into practice, here are some examples:

In [None]:
# Plotting race group by math score grouped colored by gender
ggplot(data, aes(x= race, y=math, fill=gender))+ # must include fill in aes()
geom_bar(stat= "identity")+ 
scale_fill_manual(values=c("#CC6666", "#66CC99"))+# color with number
ggtitle("Graph 1")

# Note: you can see that R automatically knows to stack this bar plot to reflect gender differences

In [None]:
# First install the package
install.packages("RColorBrewer")
library(RColorBrewer)

# Plotting race group by math score colored by race group
ggplot(data, aes(x= race, y=math, fill=race))+  # must include fill in aes()
geom_bar(stat= "identity")+
scale_fill_brewer(palette="Set1")+ # color with palette
ggtitle("Graph 2")

In all, we have seen that there are several ways in which can color ggplot graphs both from inside and outside the aes() function depending on what we want to convey. Importantly in the examples seen here we have discovered the *scale_fill_manual* function which, as its name indicates, allow us to manually select the fill of our plots.

<a id="legends"></a>
### Legends
The legend of a graph reflects the data displayed in the graph's Y-axis. We have already seen examples of legends in the bar plots above (Gender, graph 1, and Race, graph 2). There will be many times when we will want to modify the legends in our graphs to make them more informative, there might even be toimes when we want to delete the legend because it is repetitive (e.g., see graph 2 above).
<br>
<br>
First, let's learn how to change the position of a legend or to remove it. Here, you wil learn about the family of [theme](https://ggplot2.tidyverse.org/reference/ggtheme.html) functions, which contain numerous parameters to modify ggplot objects.

In [None]:
# In the case of this graph (graph 2) we know that the legend is redundant because it reflects the same information
# as the x-axis. Therefore, we remove it.
ggplot(data, aes(x= race, y=math, fill=race))+  # must include fill in aes()
geom_bar(stat= "identity")+
scale_fill_brewer(palette="Set1")+ # color with palette
theme(legend.position="none")+ # remove the legend
ggtitle("Graph 3")

# Other options
# theme(legend.position="top")
# theme(legend.position="bottom")
# theme(legend.position="left")
# theme(legend.position="right")

Now, we are going to modify the names of items in a legend:

In [None]:
# Improve name of legend and capitalize educational levels
ggplot(data, aes(x=education, y=math, fill=education))+
geom_bar(stat= "identity")+
scale_fill_discrete(name = "Parents' educational level", # Rename title of legend
                  breaks=c("associate's degree", "bachelor's degree", "high school", #set breaks in legend
                          "master's degree", "some college", "some high school"),
                  labels= c("Associate's degree", "Bachelor's degree", "High school", # set labels of legend's levels
                          "Master's degree", "Some college", "Some high school"))+
ggtitle("Graph 4")



Finally,  we will learn how to change the appearance of the legend"

In [None]:
# Improve name of legend and capitalize educational levels
ggplot(data, aes(x=education, y=math, fill=education))+
geom_bar(stat= "identity")+
scale_fill_discrete(name = "Parents' educational level", # rename title of legend
                  breaks=c("associate's degree", "bachelor's degree", "high school", #set breaks in legend
                          "master's degree", "some college", "some high school"),
                  labels= c("Associate's degree", "Bachelor's degree", "High school", # set labels of legend levels
                          "Master's degree", "Some college", "Some high school"))+
theme(legend.title = element_text(color = "blue", size = 10), # change color and size of legend title
          legend.text = element_text(color = "red")) + # change color of legend levels
theme(legend.background = element_rect(fill = "lightgray"), # Change legend key size and key width as well as key color
  legend.key.size = unit(1.5, "cm"),
  legend.key.width = unit(0.5,"cm"))+
ggtitle("Graph 5")

<a id="axes"></a>
### X & Y Axes
Up until graph 5, we have much changed and improved the plots we have been making. We will now learn something crucial, how tp modify the x and y axes of our graphs so as to make them as informatice an legible as possible.
<br>
<br>
For instance, we will start with Graph 5 whose x axis we can barely read because the different levels are all squished together. What can we do to make this better? Changing/shortening the names of the levels is not an option here, so we'll change the *angle* of the text.

In [None]:
ggplot(data, aes(x=education, y=math, fill=education))+
geom_bar(stat= "identity")+
scale_fill_discrete(name = "Parents' educational level", # rename title of legend
                  breaks=c("associate's degree", "bachelor's degree", "high school", #set breaks in legend
                          "master's degree", "some college", "some high school"),
                  labels= c("Associate's degree", "Bachelor's degree", "High school", # set labels of legend levels
                          "Master's degree", "Some college", "Some high school")) +
theme(axis.text.x  = element_text(angle=45, vjust=0.5))+
ggtitle("Graph 6")

dsa

In [None]:
# Consider that this graph is redundant for educational purposes :)
ggplot(data, aes(x=education, y=math, fill=education))+
geom_bar(stat= "identity")+
scale_fill_discrete(name = "Parents' educational level", # rename title of legend (i.e., the fill = education)
                  breaks=c("associate's degree", "bachelor's degree", "high school", #set breaks in legend
                          "master's degree", "some college", "some high school"),
                  labels= c("Associate's degree", "Bachelor's degree", "High school", # set labels of legend levels
                          "Master's degree", "Some college", "Some high school")) +
scale_x_discrete(name = "Parents' educational level", # rename x axis (i.e., education as well)
                  breaks=c("associate's degree", "bachelor's degree", "high school", # levels of x axis
                          "master's degree", "some college", "some high school"),
                  labels= c("Associate's degree", "Bachelor's degree", "High school", #rename levels of x axis
                          "Master's degree", "Some college", "Some high school"))+
theme(axis.text.x  = element_text(angle=45, vjust=0.5)) +
ggtitle("Graph 7")

Let's modify the y axis now. We see that it only shows 4 breaks, which we will change to give a wider spectrum to the scale.

In [None]:
ggplot(data, aes(x=education, y=math, fill=education))+
geom_bar(stat= "identity")+
scale_x_discrete(name = "Parents' educational level", # rename x axis (i.e., education as well)
                  breaks=c("associate's degree", "bachelor's degree", "high school", # levels of x axis
                          "master's degree", "some college", "some high school"),
                  labels= c("Associate's degree", "Bachelor's degree", "High school", #rename levels of x axis
                          "Master's degree", "Some college", "Some high school"))+
theme(legend.position = "none") + # no legend
scale_y_continuous(name="Math Score", breaks=c(0, 15000, by= 100)) +
theme(axis.text.x  = element_text(angle=45, vjust=0.5)) +
ggtitle("Graph 8")