## Left and Right Joins

Learn two more mutating joins, the left and right join, which are mirror images of each other! You'll learn use cases for each type of join as you explore parts and colors of LEGO themes. Then, you'll explore how to join tables to themselves to understand the hierarchy of LEGO themes in the data.

### Left joining two sets by part and color
In the video, you learned how to left join two LEGO sets. Now you'll practice your ability to do this looking at two new sets: the Millennium Falcon and Star Destroyer sets. 

In [2]:
# load data
parts <- readRDS("parts.rds")
part_categories <- readRDS("part_categories.rds")
inventory_parts <- readRDS("inventory_parts.rds")
inventories <- readRDS("inventories.rds")
sets <- readRDS("sets.rds")
colors <- readRDS("colors.rds")

# dplyr

library(dplyr)

inventory_parts_joined <- inventories %>%  
    inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%  
    select(-id, -version) %>%  
    arrange(desc(quantity))

millennium_falcon <- inventory_parts_joined %>%
  filter(set_num == "7965-1")

star_destroyer <- inventory_parts_joined %>%
  filter(set_num == "75190-1")

# Combine the star_destroyer and millennium_falcon tables
millennium_falcon %>%
    left_join (star_destroyer, by = c("part_num", "color_id"), suffix = c("_falcon", "_star_destroyer"))

set_num_falcon,part_num,color_id,quantity_falcon,set_num_star_destroyer,quantity_star_destroyer
7965-1,63868,71,62,,
7965-1,3023,0,60,,
7965-1,3021,72,46,75190-1,6
7965-1,2780,0,37,75190-1,36
7965-1,60478,72,36,,
7965-1,6636,71,34,75190-1,2
7965-1,3009,71,28,75190-1,2
7965-1,3665,71,22,,
7965-1,2412b,72,20,75190-1,11
7965-1,3010,71,19,,


### Left joining two sets by color
In the videos and the last exercise, you joined two sets based on their part and color. What if you joined the datasets by color alone?

In [6]:
# Aggregate Millennium Falcon for the total quantity in each part
millennium_falcon_colors <- millennium_falcon %>%
  group_by(color_id) %>%
  summarize(total_quantity = sum(quantity))

# Aggregate Star Destroyer for the total quantity in each part
star_destroyer_colors <- star_destroyer %>%
  group_by(color_id) %>%
  summarize(total_quantity = sum(quantity))

# Left join the Millennium Falcon colors to the Star Destroyer colors
millennium_falcon_colors %>%
  left_join(star_destroyer_colors, by = c("color_id"), suffix = c("_falcon", "_star_destroyer"))

`summarise()` ungrouping output (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)


color_id,total_quantity_falcon,total_quantity_star_destroyer
0,201,336.0
1,15,23.0
4,17,53.0
14,3,4.0
15,15,17.0
19,95,12.0
28,3,16.0
33,5,
36,1,14.0
41,6,15.0


### Finding an observation that doesn't have a match
Left joins are really great for testing your assumptions about a data set and ensuring your data has integrity.

For example, the inventories table has a version column, for when a LEGO kit gets some kind of change or upgrade. It would be fair to assume that all sets (which joins well with inventories) would have at least a version 1. But let's test this assumption out in the following exercise.

In [8]:
inventory_version_1 <- inventories %>%
  filter(version == 1)

# Join versions to sets
sets %>%
  left_join(inventory_version_1, by = c("set_num")) %>%
  # Filter for where version is na
  filter(is.na(version))

set_num,name,year,theme_id,id,version
40198-1,Ludo game,2018,598,,


### Counting part colors
Sometimes you'll want to do some processing before you do a join, and prioritize keeping the second (right) table's rows instead. In this case, a right join is for you.

In the example below, we'll count the part_cat_id from parts, before using a right_join to join with part_categories. The reason we do this is because we don't only want to know the count of part_cat_id in parts, but we also want to know if there are any part_cat_ids not present in parts.

In [9]:
parts %>%
# Count the part_cat_id
    count(part_cat_id) %>%
# Right join part_categories
    right_join(part_categories, by = c("part_cat_id" = "id"))

part_cat_id,n,name
1,135,Baseplates
3,303,Bricks Sloped
4,1900,"Duplo, Quatro and Primo"
5,107,Bricks Special
6,128,Bricks Wedged
7,97,Containers
8,24,Technic Bricks
9,167,Plates Special
11,490,Bricks
12,85,Technic Connectors


In [10]:
parts %>%
    count(part_cat_id) %>%
    right_join(part_categories, by = c("part_cat_id" = "id")) %>%
    # Filter for NA
    filter(is.na(n))

part_cat_id,n,name
66,,Modulex


### Cleaning up your count
In both left and right joins, there is the opportunity for there to be NA values in the resulting table. Fortunately, the replace_na function can turn those NAs into meaningful values.

In the last exercise, we saw that the n column had NAs after the right_join. Let's use the replace_na column, which takes a list of column names and the values with which NAs should be replaced, to clean up our table.

In [12]:
# library to replace_na
library(tidyr)

parts %>%
    count(part_cat_id) %>%
    right_join(part_categories, by = c("part_cat_id" = "id")) %>%
    # Use replace_na to replace missing values in the n column
    replace_na(list(n = 0))

"package 'tidyr' was built under R version 3.6.3"

part_cat_id,n,name
1,135,Baseplates
3,303,Bricks Sloped
4,1900,"Duplo, Quatro and Primo"
5,107,Bricks Special
6,128,Bricks Wedged
7,97,Containers
8,24,Technic Bricks
9,167,Plates Special
11,490,Bricks
12,85,Technic Connectors


In [13]:
# load data
themes <- readRDS("themes.rds")

themes %>% 
    # Inner join the themes table
    inner_join(themes, by =c("id" = "parent_id"), suffix = c("_parent","_child")) %>%
    # Filter for the "Harry Potter" parent name 
    filter(name_parent == "Harry Potter")

id,name_parent,parent_id,id_child,name_child
246,Harry Potter,,247,Chamber of Secrets
246,Harry Potter,,248,Goblet of Fire
246,Harry Potter,,249,Order of the Phoenix
246,Harry Potter,,250,Prisoner of Azkaban
246,Harry Potter,,251,Sorcerer's Stone
246,Harry Potter,,667,Fantastic Beasts


### Joining themes to their grandchildren
We can go a step further than looking at themes and their children. Some themes actually have grandchildren: their children's children.

Here, we can inner join themes to a filtered version of itself again to establish a connection between our last join's children and their children.

In [15]:
# Join themes to itself again to find the grandchild relationships
themes %>% 
  inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
  inner_join(themes, by = c("id_child" = "parent_id"), suffix = c("_parent", "_grandchild"))

id_parent,name_parent,parent_id,id_child,name_child,id_grandchild,name
1,Technic,,5,Model,6,Airport
1,Technic,,5,Model,7,Construction
1,Technic,,5,Model,8,Farm
1,Technic,,5,Model,9,Fire
1,Technic,,5,Model,10,Harbor
1,Technic,,5,Model,11,Off-Road
1,Technic,,5,Model,12,Race
1,Technic,,5,Model,13,Riding Cycle
1,Technic,,5,Model,14,Robot
1,Technic,,5,Model,15,Traffic


### Left-joining a table to itself
So far, you've been inner joining a table to itself in order to find the children of themes like "Harry Potter" or "The Lord of the Rings".

But some themes might not have any children at all, which means they won't be included in the inner join. As you've learned in this chapter, you can identify those with a left_join and a filter().

In [16]:
themes %>% 
  # Left join the themes table to its own children
  left_join(themes, by =c("id" = "parent_id"), suffix = c("_parent","_child")) %>%
  # Filter for themes that have no child themes
  filter(is.na(name_child))

id,name_parent,parent_id,id_child,name_child
2,Arctic Technic,1,,
3,Competition,1,,
4,Expert Builder,1,,
6,Airport,5,,
7,Construction,5,,
8,Farm,5,,
9,Fire,5,,
10,Harbor,5,,
11,Off-Road,5,,
12,Race,5,,
